爬虫 120 例之第 17 例，用 Python 面向对象的思路，采集各种精彩句子

梦想橡皮擦

关注

发布于: 4 小时前

采集完这 7000+句子，里面好多神转折的段子呀 eg:我若带伞，便是晴天，若不带伞，便是雨天。

目标站点分析

本次要抓取的目标站点地址为学句子网，目标地址为 http://www.xuejuzi.cn/gaoxiao/，第一步需要获取下图红框位置详情页链接。

列表页分页规律如下，区分第一页即可。

http://www.xuejuzi.cn/gaoxiaohttp://www.xuejuzi.cn/gaoxiao/2.htmlhttp://www.xuejuzi.cn/gaoxiao/3.html

复制代码

由于网页中存在 末页 数据，可通过提取页面数据获取总页码。

详情页数据提取也比较简单，目标数据存在于 p 标签中。

详细编码如下

本案例详细代码如下，重要部分已经添加到注释中。

import requestsfrom lxml import etreeimport random

class Spider16:    def __init__(self):
        self.wait_urls = ["http://www.xuejuzi.cn/gaoxiao/"]        self.url_template = "http://www.xuejuzi.cn/gaoxiao/{num}.html"        self.details = []
    def get_headers(self):        uas = [            "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)",            "Mozilla/5.0 (compatible; Baiduspider-render/2.0; +http://www.baidu.com/search/spider.html)",            "Baiduspider-image+(+http://www.baidu.com/search/spider.htm)",            "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 YisouSpider/5.0 Safari/537.36",            "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",            "Mozilla/5.0 (compatible; Googlebot-Image/1.0; +http://www.google.com/bot.html)",            "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",            "Sogou News Spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)",            "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",            "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)",            "Sosospider+(+http://help.soso.com/webspider.htm)",            "Mozilla/5.0 (compatible; Yahoo! Slurp China; http://misc.yahoo.com.cn/help.html)"        ]        ua = random.choice(uas)        headers = {            "user-agent": ua,            "referer": "https://www.baidu.com"        }        return headers
    # 生成待爬取页面    def create_urls(self):        headers = self.get_headers()        page_url = self.wait_urls[0]        res = requests.get(url=page_url, headers=headers, timeout=5)        html = etree.HTML(res.text)        # 提取总页码        last_page = html.xpath("//div[@class='page']/a[last()]/@href")        if len(last_page) > 0:            last_page = int(last_page[0].split(".")[0])
        # 生成待爬取页面        for i in range(1, last_page + 1):            self.wait_urls.append(self.url_template.format(num=i))
    def get_html(self):        for url in self.wait_urls:            headers = self.get_headers()            res = requests.get(url, headers=headers, timeout=5)            if res:                html = etree.HTML(res.text)                detail_link = html.xpath("//dl/dd[1]/a/@href")                self.details.extend(detail_link)
    def get_detail(self):        for url in self.details:            headers = self.get_headers()            res = requests.get(url, headers=headers, timeout=5)            res.encoding = "gb2312"            if res:                html = etree.HTML(res.text)                sentences = html.xpath("//div[@class='content']/p/text()")                # 打印句子                long_str = "\n".join(sentences)
                with open("sentences.txt","a+",encoding="utf-8") as f:                    f.write(long_str)
    def run(self):        self.create_urls()        self.get_html()        self.get_detail()
if __name__ == '__main__':    s = Spider16()    s.run()

复制代码

最终爬取到的句子，有的确实有趣：

1，时间真的很宝贵，就差一秒厕所就被其他人抢了。2，我要给我未来婆婆一个差评，发货太慢。3，爱上你，疼死了自己。4，戒烟了，再抽真就腾云驾雾了！5，我发现这么多年我就是一个裤衩，什么屁都得接着。6，祝我生日快乐！愿我未来的媳妇找到我，我们赶紧登记结婚生孩子。

复制代码

收藏时间

代码下载地址：https://codechina.csdn.net/hihell/python120，可否给个 Star。

本案例采集到的素材下载

来都来了，不发个评论，点个赞，收个藏吗？

今天是持续写作的第 196 / 200 天。可以关注我，点赞我、评论我、收藏我啦。

发布于: 4 小时前阅读数: 5

原文链接:【http://xie.infoq.cn/article/5562efe65b3921f13ce3fd2d6】。文章转载请联系作者。

梦想橡皮擦

关注

爬虫 100 例作者，蓝桥签约作者，博客专家 2021.02.06 加入

6 年产品经理+教学经验，3 年互联网项目管理经验；互联网资深爱好者；沉迷各种技术无法自拔，导致年龄被困在 25 岁； CSDN 爬虫 100 例作者。个人公众号“梦想橡皮擦”。

发布

暂无评论

创作场景

爬虫 120 例之第 17 例，用 Python 面向对象的思路，采集各种精彩句子

目标站点分析

详细编码如下

收藏时间

梦想橡皮擦

评论