练手练到阅文集团作家中心了，python crawlspider 二维抓取学习

2021 年 12 月 10 日
本文字数：2555 字
阅读完需：约 8 分钟

本篇博客学习使用 CrawlSpider 进行二维抓取。

目标站点分析

本次要采集的目标站点为：阅文集团作家中心

分页地址具备一定规则，具体如下：

https://write.qq.com/portal/article?filterType=0&page={页码}

复制代码

由于本文重点学习内容为简单操作 scrapy 实现爬虫，所以目标详情页仅提取标题即可。

知识说明

本篇博客生成爬虫文件，使用 scrapy genspider -t crawl yw write.qq.com，默认生成的代码如下所示：

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rule

class CbSpider(CrawlSpider):    name = 'yw'    allowed_domains = ['write.qq.com']    start_urls = ['https://write.qq.com/portal/article?filterType=0&page=1']
    rules = (        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),    )
    def parse_item(self, response):        item = {}        #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()        #item['name'] = response.xpath('//div[@id="name"]').get()        #item['description'] = response.xpath('//div[@id="description"]').get()        return item

复制代码

其中 CrawlSpider 继承自 Spider ，派生该类的原因是为了简化爬取编码，其中增加了类属性 rules ，其值为Rule 对象的集合元组，用于匹配目标网站并排除干扰。

rules = (    Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),)

复制代码

** Rule 类的构造函数原型如下**

Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)

复制代码

参数说明如下：

link_extractor：LinkExtractor 对象，定义了如何从爬取到的页面提取链接；
callback：回调函数，也可以是回调函数的字符串名。接收 Response 对象作为参数，返回包含 Item 或者 Request 对象列表；
cb_kwargs：用于作为 **kwargs 参数，传递给 callback；
follow：布尔值，默认是 False，用于判断是否继续从该页面提取链接；
process_links：指定该 spider 中哪个函数将会被调用，从 link_extractor 中获取到链接列表后悔调用该函数，该方法主要用来过滤 url；
process_request：回调函数，也可以是回调函数的字符串名。用来过滤 Request ，该规则提取到每个 Request 时都会调用该函数。

** LinkExtractor 类构造函数原型如下**

LinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(), tags=('a', 'area'),              attrs=('href',), canonicalize=False, unique=True, process_value=None, deny_extensions=None,              restrict_css=(), strip=True, restrict_text=None,              )

复制代码

其中比较重要的参数如下所示：

allow：正则表达式，满足的会被提取，默认为空，表示提取全部超链接；
deny：正则表达式，不被提取的内容；
allow_domains：允许的 domains；
deny_domains：不允许的 domains；
restrict_xpaths：xpath 表达式
tags：提取的标签；
attrs：默认属性值；
restrict_css：css 表达式提取；
restrict_text：文本提取。

编码时间

在 rules 变量中，设置两个 URL 提取规则。

# URL 提取规则rules = (    Rule(LinkExtractor(allow=r'.*/portal/content\?caid=\d+&feedType=2&lcid=\d+$'), callback="parse_item"),    # 寻找下一页 url 地址    Rule(LinkExtractor(restrict_xpaths="//a[@title='下一页']"), follow=True),)

复制代码

代码重点在寻找下一页部分，使用的是 restrict_xpaths 方法，即基于 xpath 提取，原因是由于使用正则表达式提取下一页地址，写起来略微繁琐。第一个 Rule 规则是获取内容详情页数据，回调函数 callback 是 parse_item。

完整代码如下所示：

import scrapyfrom scrapy.linkextractors import LinkExtractorfrom scrapy.spiders import CrawlSpider, Rule

class CbSpider(CrawlSpider):    name = 'yw'    allowed_domains = ['write.qq.com']    start_urls = ['https://write.qq.com/portal/article?filterType=0&page=1']    # URL 提取规则    rules = (        Rule(LinkExtractor(allow=r'.*/portal/content\?caid=\d+&feedType=2&lcid=\d+$'), callback="parse_item"),        # 寻找下一页 url 地址        Rule(LinkExtractor(restrict_xpaths="//a[@title='下一页']"), follow=True),    )
    def parse_item(self, response):        print("测试输出")        print(response.url)        title = response.css('title::text').extract()[0]        print(title)        item = {}        # item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()        # item['name'] = response.xpath('//div[@id="name"]').get()        # item['description'] = response.xpath('//div[@id="description"]').get()        return item

复制代码

编写完毕，在 settings.py 文件中修改如下内容：

ROBOTSTXT_OBEY = False # 关闭 robots.txt 文件请求LOG_LEVEL = 'WARNING' # 调高日志输出级别

复制代码

最后得到的结果如下所示。

知识再次补充

在 CrawlSpider 的派生类中，你还可以重写如下方法：

parse_start_url()：在采集前对 start_urls 进行覆盖修改；
process_results()：数据返回时进行再次加工，该函数需要配合 callback 参数设定的函数使用。

def parse_start_url(self, response):    print("---process_results---")    yield scrapy.Request('https://write.qq.com/portal/article?filterType=0&page=1')
def process_results(self, response, results):    print("---process_results---")    print(results)
def parse_item(self, response):    print("---parse_item---")    print(response.url)    title = response.css('title::text').extract()[0].strip()    item = {}    item["title"] = title    yield item

复制代码

写在后面

今天是持续写作的第 248 / 365 天。期待 关注，点赞、评论、收藏。

发布于: 2 小时前阅读数: 5

原文链接:【http://xie.infoq.cn/article/648f8ca458704ba377973db42】。文章转载请联系作者。

梦想橡皮擦

关注

爬虫 100 例作者，蓝桥签约作者，博客专家 2021.02.06 加入

6 年产品经理+教学经验，3 年互联网项目管理经验；互联网资深爱好者；沉迷各种技术无法自拔，导致年龄被困在 25 岁； CSDN 爬虫 100 例作者。个人公众号“梦想橡皮擦”。

发布

暂无评论

创作场景

练手练到阅文集团作家中心了，python crawlspider 二维抓取学习

目标站点分析

知识说明

编码时间

知识再次补充

写在后面

梦想橡皮擦

评论