Python使用scrapy抓取网站sitemap信息的方法

yipeiwu_com6年前 (2020-03-06)Python爬虫

本文实例讲述了Python使用scrapy抓取网站sitemap信息的方法。分享给大家供大家参考。具体如下：

import re
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.utils.response import body_or_str
from scrapy.http import Request
from scrapy.selector import HtmlXPathSelector
class SitemapSpider(BaseSpider):
 name = "SitemapSpider"
 start_urls = ["http://www.domain.com/sitemap.xml"]
 def parse(self, response):
  nodename = 'loc'
  text = body_or_str(response)
  r = re.compile(r"(<%s[\s>])(.*?)(</%s>)"%(nodename,nodename),re.DOTALL)
  for match in r.finditer(text):
   url = match.group(2)
   yield Request(url, callback=self.parse_page)
 def parse_page(self, response):
    hxs = HtmlXPathSelector(response)
    #Mock Item
  blah = Item()
  #Do all your page parsing and selecting the elemtents you want
    blash.divText = hxs.select('//div/text()').extract()[0]
  yield blah

希望本文所述对大家的Python程序设计有所帮助。

返回列表

上一篇：Python字符转换

下一篇：PHP生成静态页面详解

python制作最美应用的爬虫

安卓最美应用页面爬虫，爬虫很简单，设计的东西到挺多的文件操作正则表达式字符串替换等等 import requests import re url = "http://zuime...

Python爬虫实例爬取网站搞笑段子

众所周知，python是写爬虫的利器，今天作者用python写一个小爬虫爬下一个段子网站的众多段子。目标段子网站为“http://ishuo.cn/”，我们先分析其下段子的所在子页的u...

Python的爬虫包Beautiful Soup中用正则表达式来搜索

Beautiful Soup使用时，一般可以通过指定对应的name和attrs去搜索，特定的名字和属性，以找到所需要的部分的html代码。但是，有时候，会遇到，对于要处理的内容中，其n...

零基础写python爬虫之urllib2中的两个重要概念：Openers和Handlers

在开始后面的内容之前，先来解释一下urllib2中的两个个方法：info / geturl urlopen返回的应答对象response(或者HTTPError实例)有两个很...

Python爬虫常用小技巧之设置代理IP

设置代理IP的原因我们在使用Python爬虫爬取一个网站时，通常会频繁访问该网站。假如一个网站它会检测某一段时间某个IP的访问次数，如果访问次数过多，它会禁止你的访问。所以你可以设置...

宜配屋

Python使用scrapy抓取网站sitemap信息的方法

相关文章

python制作最美应用的爬虫

Python爬虫实例爬取网站搞笑段子

Python的爬虫包Beautiful Soup中用正则表达式来搜索

零基础写python爬虫之urllib2中的两个重要概念：Openers和Handlers

Python爬虫常用小技巧之设置代理IP

© YiPeiWu.com 【宜配屋】粤ICP备17031333号

Powered By Z-BlogPHP. Theme by TOYEAN.

宜配屋

Python使用scrapy抓取网站sitemap信息的方法

相关文章

python制作最美应用的爬虫

Python爬虫实例爬取网站搞笑段子

Python的爬虫包Beautiful Soup中用正则表达式来搜索

零基础写python爬虫之urllib2中的两个重要概念：Openers和Handlers

Python爬虫常用小技巧之设置代理IP

© YiPeiWu.com 【宜配屋】 粤ICP备17031333号 var _hmt = _hmt || [];(function() { var hm = document.createElement("script"); hm.src = "https://hm.baidu.com/hm.js?8aa60ae04b767b2af31903508928acc0"; var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s);})();

Powered By Z-BlogPHP. Theme by TOYEAN.

© YiPeiWu.com 【宜配屋】粤ICP备17031333号