Python自定义scrapy中间模块避免重复采集的方法

yipeiwu_com6年前Python基础

本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from myproject.items import MyItem
class IgnoreVisitedItems(object):
  """Middleware to ignore re-visiting item pages if they
  were already visited before. 
  The requests to be filtered by have a meta['filter_visited']
  flag enabled and optionally define an id to use 
  for identifying them, which defaults the request fingerprint,
  although you'd want to use the item id,
  if you already have it beforehand to make it more robust.
  """
  FILTER_VISITED = 'filter_visited'
  VISITED_ID = 'visited_id'
  CONTEXT_KEY = 'visited_ids'
  def process_spider_output(self, response, result, spider):
    context = getattr(spider, 'context', {})
    visited_ids = context.setdefault(self.CONTEXT_KEY, {})
    ret = []
    for x in result:
      visited = False
      if isinstance(x, Request):
        if self.FILTER_VISITED in x.meta:
          visit_id = self._visited_id(x)
          if visit_id in visited_ids:
            log.msg("Ignoring already visited: %s" % x.url,
                level=log.INFO, spider=spider)
            visited = True
      elif isinstance(x, BaseItem):
        visit_id = self._visited_id(response.request)
        if visit_id:
          visited_ids[visit_id] = True
          x['visit_id'] = visit_id
          x['visit_status'] = 'new'
      if visited:
        ret.append(MyItem(visit_id=visit_id, visit_status='old'))
      else:
        ret.append(x)
    return ret
  def _visited_id(self, request):
    return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。

相关文章

pycharm new project变成灰色的解决方法

在ubuntu下面发生的 原因是:开了多个pycharm,关掉那个new project选项是灰色的,剩下的那个pycharm的new project应该就能用。 以上这篇pycharm...

Django1.11配合uni-app发起微信支付的实现

Django1.11配合uni-app发起微信支付的实现

Django1.11配合uni-app发起微信支付! 经过三天的断断续续的奋战,我终于是干动了微信支付。为了以后不忘记,现在来一篇教程,来来来,开干!!! 一、准备阶段 1、准备阶段我...

Python中os.path用法分析

本文实例分析了Python中os.path用法。分享给大家供大家参考。具体如下: 复制代码 代码如下:#coding=utf-8 import os print os.path.absp...

关于python pyqt5安装失败问题的解决方法

前言 最近在工作中遇到一个问题,python pyqt5在安装的时候居然提示失败了,无奈只能找解决的办法,发现网上有同样遇到这个问题的同学,所以就总结了解决的方法分享出来,下面话不多说了...

Python将列表数据写入文件(txt, csv,excel)

写入txt文件 def text_save(filename, data):#filename为写入CSV文件的路径,data为要写入数据列表. file = open(file...