10994 部漫画信息，用 Python 实施大采集，因为反爬差一点就翻车了

梦想橡皮擦

关注

发布于: 7 小时前

橡皮擦的周末时间，浏览互联网，畅游知识的海洋，寻找好看的动漫，然后就发现了本文的主角，一个来自台湾省的网站。

Kindle 漫畫|Kobo 漫畫|epub 漫畫大采集

数据源分析

爬取目标

本次要爬取的网站是 https://vol.moe/，该网站打开的第一眼，就给我呈现了一个大数，收录 10994 部漫画，必须拿下。

为了降低博客的篇幅，还有大家练习的难度，本文只针对列表页抓取，里面涉及的目标数据结构如下：

漫画标题
漫画评分
漫画详情页链接
作者

同步保存漫画封面，文件名为漫画标题。

漫画详情页还存在大量的标签，因不涉及数据分析，故不再进行提取。

使用的 Python 模块

爬取模块 requests，数据提取模块 re，多线程模块 threading

重点学习内容

爬虫基本套路
数据提取，重点在行内 CSS 提取
CSV 格式文件存储
IP 限制反爬，没想到目标网站有反爬。

列表页分析

通过点击测试，得到的页码变化逻辑如下：

1. https://vol.moe/l/all,all,all,sortpoint,all,all,BL/1.htm2. https://vol.moe/l/all,all,all,sortpoint,all,all,BL/2.htm3. https://vol.moe/l/all,all,all,sortpoint,all,all,BL/524.htm

复制代码

页面数据为服务器直接静态返回，即加载到 HTML 页面中。

目标数据所在的 HTML DOM 结构如下所示。

在正式编写代码前，可以先针对性的处理一下正则表达式部分。

匹配封面图

编写该部分正则表达式时，出现如下图所示问题，折行+括号。

基于正则表达式编写经验，正则表达式如下，重点学习 \s 可匹配换行符。

<div class="img_book"[.\s]*style="background:url\((.*?)\)

复制代码

匹配文章标题、作者、详情页地址

由于该部分内容所在的 DOM 行格式比较特殊，可以直接匹配出结果，同时下述正则表达式使用了分组命名。

<a href='(?P<url>.*?)'>(?P<title>.*?)</a> <br /> (?P<author>.*?) <br />

复制代码

对于动漫评分部分的匹配就非常简单了，直接编写如下正则表达式即可。

<p style=".*?"><b>(.*?)</b></p>

复制代码

整理需求如下

开启 5 个线程对数据进行爬取；
保存所有的网页源码；
依次读取源码，提取元素到 CSV 文件中。

本案例将优先保存网页源码到本地，然后在对本地 HTML 文件进行处理。

编码时间

在编码测试的过程中，发现该网站存在反爬措施，即爬取过程中，如果网站发现是爬虫，直接就把 IP 给封杀掉了，程序还没有跑起来，就结束了。

再次测试，发现反爬技术使用重定向操作，即服务器发现是爬虫之后，直接返回状态码 302，并重定向谷歌网站，这波操作可以。

受限制的时间，经测试大概是 24 个小时，因此如果希望爬取到全部数据，需要通过不断切换 IP 和 UA ，将 HTML 静态文件保存到本地中。

下述代码通过判断目标网站返回状态码是否为 200，如果为 302，更换代理 IP，再次爬取。

import requestsimport reimport threadingimport timeimport random
# 以下为UA列表，为节约博客篇幅，列表只保留两项，完整版请去CSDN代码频道下载USER_AGENTS = [    "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",]

# 循环获取 URLdef get_image(base_url, index):    headers = {        "User-Agent": random.choice(USER_AGENTS)
    }
    print(f"正在抓取{index}")    try:        res = requests.get(url=base_url, headers=headers, allow_redirects=False, timeout=10)        print(res.status_code)        # 当服务器返回 302状态码之后        while res.status_code == 302:          # 可直接访问 http://118.24.52.95:5010/get/ 获得一个代理IP            ip_json = requests.get("http://118.24.52.95:5010/get/", headers=headers).json()            ip = ip_json["proxy"]            proxies = {                "http": ip,                "https": ip            }            print(proxies)            # 使用代理IP继续爬取            res = requests.get(url=base_url, headers=headers, proxies=proxies, allow_redirects=False, timeout=10)            time.sleep(5)            print(res.status_code)
        else:            html = res.text            # 读取成功，保存为 html 文件            with open(f"html/{index}.html", "w+", encoding="utf-8") as f:                f.write(html)
        semaphore.release()    except Exception as e:        print(e)        print("睡眠 10s，再去抓取")        time.sleep(10)        get_image(base_url, index)

if __name__ == '__main__':    num = 0    # 最多开启5个线程    semaphore = threading.BoundedSemaphore(5)    lst_record_threads = []    for index in range(1, 525):        semaphore.acquire()        t = threading.Thread(target=get_image, args=(            f"https://vol.moe/l/all,all,all,sortpoint,all,all,BL/{index}.htm", index))        t.start()        lst_record_threads.append(t)
    for rt in lst_record_threads:        rt.join()

复制代码

只开启了 5 个线程，爬取过程如下所示。

经过 20 多分钟的等待，524 页数据，全部以静态页形式保存在了本地 HTML 文件夹中。

当文件全部存储到本地之后，在进行数据提取就非常简单了

编写如下提取代码，使用 os 模块。

import osimport reimport requests

def reade_html():    path = r"E:\pythonProject\test\html"    # 读取文件    files = os.listdir(path)
    for file in files:      # 拼接完整路径        file_path = os.path.join(path, file)        with open(file_path, "r", encoding="utf-8") as f:            html = f.read()            # 正则提取图片            img_pattern = re.compile('<div class="img_book"[.\s]*style="background:url\((.*?)\)')            # 正则提取标题            title_pattern = re.compile("<a href='(?P<url>.*?)'>(?P<title>.*?)</a> <br /> \[(?P<author>.*?)\] <br />")            # 正则提取得到            score_pattern = re.compile('<p style=".*?"><b>(.*?)</b></p>')            img_urls = img_pattern.findall(html)            details = title_pattern.findall(html)            scores = score_pattern.findall(html)
            # save(details, scores)            for index, url in enumerate(img_urls):                save_img(details[index][1], url)
# 数据保存成 csv 文件def save(details, scores):    for index, detail in enumerate(details):        my_str = "%s,%s,%s,%s\n" % (detail[1].replace(",", "，"), detail[0], detail[2].replace(",", "，"), scores[index])        with open("./comic.csv", "a+", encoding="utf-8") as f:            f.write(my_str)
# 图片按照动漫标题命名def save_img(title, url):    print(f"正在抓取{title}--{url}")    headers = {        "User-Agent": "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52"    }    try:        res = requests.get(url, headers=headers, allow_redirects=False, timeout=10)
        data = res.content        with open(f"imgs/{title}.jpg", "wb+") as f:            f.write(data)
    except Exception as e:        print(e)

if __name__ == '__main__':    reade_html()