宜配屋

装饰器在 python 中用的相当广泛，如果你用过 python 的一些 web 框架，那么一定对其中的 “ route() 装饰器” 不陌生，今天咱们再看一个具体的案例。

咱们来模拟一个场景，需要你去抓去一个页面，然后这个页面有好多url也要分别去抓取，而进入这些子url后，还有数据要抓取。简单点，我们就按照三层来看，那我们的代码就是如下：

def func_top(url):
  data_dict= {}
 
  #在页面上获取到子url
  sub_urls = xxxx
 
  data_list = []
  for it in sub_urls:
    data_list.append(func_sub(it))
 
  data_dict['data'] = data_list
 
  return data_dict
 
def func_sub(url):
  data_dict= {}
 
  #在页面上获取到子url
  bottom_urls = xxxx
 
  data_list = []
  for it in bottom_urls:
    data_list.append(func_bottom(it))
 
  data_dict['data'] = data_list
 
  return data_dict
 
def func_bottom(url):
  #获取数据
  data = xxxx
  return data

func_top是上层页面的处理函数，func_sub是子页面的处理函数，func_bottom是最深层页面的处理函数，func_top会在取到子页面url后遍历调用func_sub,func_sub也是同样。
如果正常情况下，这样确实已经满足需求了，但是偏偏这个你要抓取的网站可能极不稳定，经常链接不上，导致数据拿不到。
于是这个时候你有两个选择:
1.遇到错误就停止，之后重新从断掉的位置开始重新跑
2.遇到错误继续，但是要在之后重新跑一遍，这个时候已经有的数据不希望再去网站拉一次，而只去拉没有取到的数据
对第一种方案基本无法实现，因为如果别人网站的url调整顺序，那么你记录的位置就无效了。那么只有第二种方案，说白了，就是要把已经拿到的数据cache下来，等需要的时候，直接从cache里面取。
OK，目标已经有了，怎么实现呢？
如果是在C++中的，这是个很麻烦的事情，而且写出来的代码必定丑陋无比，然而庆幸的是，我们用的是python，而python对函数有装饰器。
所以实现方案也就有了:
定义一个装饰器，如果之前取到数据，就直接取cache的数据；如果之前没有取到，那么就从网站拉取，并且存入cache中.
代码如下:

import os
import hashlib
 
def deco_args_recent_cache(category='dumps'):
  '''
  装饰器，返回最新cache的数据
  '''
  def deco_recent_cache(func):
    def func_wrapper(*args, **kargs):
      sig = _mk_cache_sig(*args, **kargs)
      data = _get_recent_cache(category, func.__name__, sig)
      if data is not None:
        return data
 
      data = func(*args, **kargs)
      if data is not None:
        _set_recent_cache(category, func.__name__, sig, data)
      return data
 
    return func_wrapper
 
  return deco_recent_cache
 
def _mk_cache_sig(*args, **kargs):
  '''
  通过传入参数，生成唯一标识
  '''
  src_data = repr(args) + repr(kargs)
  m = hashlib.md5(src_data)
  sig = m.hexdigest()
  return sig
 
def _get_recent_cache(category, func_name, sig):
  full_file_path = '%s/%s/%s' % (category, func_name, sig)
  if os.path.isfile(full_file_path):
    return eval(file(full_file_path,'r').read())
  else:
    return None
 
def _set_recent_cache(category, func_name, sig, data):
  full_dir_path = '%s/%s' % (category, func_name)
  if not os.path.isdir(full_dir_path):
    os.makedirs(full_dir_path)
 
  full_file_path = '%s/%s/%s' % (category, func_name, sig)
  f = file(full_file_path, 'w+')
  f.write(repr(data))
  f.close()

然后，我们只需要在每个func_top,func_sub,func_bottom都加上deco_args_recent_cache这个装饰器即可~~
搞定！这样做最大的好处在于，因为top,sub,bottom，每一层都会dump数据，所以比如某个sub层数据dump之后，是根本不会走到他所对应的bottom层的，减少了大量的开销！
OK，就这样~ 人生苦短，我用python！

注：

python3 已经原生支持了这种功能！链接如下:

http://docs.python.org/py3k/whatsnew/3.2.html#functools

进一步探究Python的装饰器的运用

相关文章

使用Python做定时任务及时了解互联网动态

使用Python编写vim插件的简单示例

Python使用post及get方式提交数据的实例

python实现简单日志记录库glog的使用

浅谈Python数据类型判断及列表脚本操作

© YiPeiWu.com 【宜配屋】粤ICP备17031333号

Powered By Z-BlogPHP. Theme by TOYEAN.

宜配屋