Python自定义scrapy中间模块避免重复采集的方法

yipeiwu_com5年前Python基础

本文实例讲述了Python自定义scrapy中间模块避免重复采集的方法。分享给大家供大家参考。具体如下:

from scrapy import log
from scrapy.http import Request
from scrapy.item import BaseItem
from scrapy.utils.request import request_fingerprint
from myproject.items import MyItem
class IgnoreVisitedItems(object):
  """Middleware to ignore re-visiting item pages if they
  were already visited before. 
  The requests to be filtered by have a meta['filter_visited']
  flag enabled and optionally define an id to use 
  for identifying them, which defaults the request fingerprint,
  although you'd want to use the item id,
  if you already have it beforehand to make it more robust.
  """
  FILTER_VISITED = 'filter_visited'
  VISITED_ID = 'visited_id'
  CONTEXT_KEY = 'visited_ids'
  def process_spider_output(self, response, result, spider):
    context = getattr(spider, 'context', {})
    visited_ids = context.setdefault(self.CONTEXT_KEY, {})
    ret = []
    for x in result:
      visited = False
      if isinstance(x, Request):
        if self.FILTER_VISITED in x.meta:
          visit_id = self._visited_id(x)
          if visit_id in visited_ids:
            log.msg("Ignoring already visited: %s" % x.url,
                level=log.INFO, spider=spider)
            visited = True
      elif isinstance(x, BaseItem):
        visit_id = self._visited_id(response.request)
        if visit_id:
          visited_ids[visit_id] = True
          x['visit_id'] = visit_id
          x['visit_status'] = 'new'
      if visited:
        ret.append(MyItem(visit_id=visit_id, visit_status='old'))
      else:
        ret.append(x)
    return ret
  def _visited_id(self, request):
    return request.meta.get(self.VISITED_ID) or request_fingerprint(request)

希望本文所述对大家的Python程序设计有所帮助。

相关文章

python使用pandas处理excel文件转为csv文件的方法示例

由于客户提供的是excel文件,在使用时期望使用csv文件格式,且对某些字段内容需要做一些处理,如从某个字段中固定的几位抽取出来,独立作为一个字段等,下面记录下使用acaconda处理的...

python挖矿算力测试程序详解

python挖矿算力测试程序详解

谈到比特币,我们都知道挖矿,有些人并不太明白挖矿的含义。这里的挖矿其实就是哈希的碰撞,举个简单例子: import hashlib x = 11 y = 1 #这里可以调节挖矿难度,...

解决pip install的时候报错timed out的问题

安装包的时候报错,执行:pip install pyinstaller 问题: File "c:\python\python35\lib\site-packages\pip\_ven...

python实现ftp客户端示例分享

复制代码 代码如下:#!/usr/bin/python#coding:utf-8#write:JACK#info:ftp exampleimport ftplib, socket, os...

使用Python处理Excel表格的简单方法

使用Python处理Excel表格的简单方法

Excel 中的每一个单元,都会有这些属性:颜色(colors)、number formatting、字体(fonts)、边界(borders)、alignment、模式(pattern...