Scrapy 爬取西刺代理存入 MySQL & MongoDB 数据库（手把手教学，超详细步骤）

若尘

关注

发布于: 46 分钟前

Scrapy 爬取西刺代理

1. 创建项目

scrapy startproject XcSpider

2. 创建爬虫实例

scrapy genspider xcdl xicidaili.com

先把项目文件夹 Sources Root 一下，防止导入自己的文件时出错

3. 创建一个启动文件 main.py

from scrapy import cmdlinecmdline.execute('scrapy crawl xcdl'.split())

复制代码

4. 项目的总体树结构

Windows 下查看树结构命令 tree /F（/F 可显示完整文件）

│   main.py│   scrapy.cfg│   xcdl.log│└───XcSpider    │   items.py    │   middlewares.py    │   pipelines.py    │   settings.py    │   __init__.py    │    ├───mysqlpipelines    │   │   pipelines.py    │   │   sql.py    │   │   __init__.py    │   │    │   └───__pycache__    │           pipelines.cpython-36.pyc    │           sql.cpython-36.pyc    │           __init__.cpython-36.pyc    │    ├───spiders    │   │   xcdl.py    │   │   __init__.py    │   │    │   └───__pycache__    │           xcdl.cpython-36.pyc    │           __init__.cpython-36.pyc    │    └───__pycache__            items.cpython-36.pyc            pipelines.cpython-36.pyc            settings.cpython-36.pyc            __init__.cpython-36.pyc

复制代码

5. settings.py 文件配置

设置 MySQL、MongoDB 数据相关配置
设置 ITEM_PIPELINES （中间件最后添加），这个后面会有讲解
设置 DEFAULT_REQUEST_HEADERS （头部信息），防止反爬，我们加入头部信息

# -*- coding: utf-8 -*-
# Scrapy settings for XcSpider project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:##     https://docs.scrapy.org/en/latest/topics/settings.html#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XcSpider'
SPIDER_MODULES = ['XcSpider.spiders']NEWSPIDER_MODULE = 'XcSpider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'
# Obey robots.txt rulesROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False
# Override the default request headers:DEFAULT_REQUEST_HEADERS = {  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',  'Accept-Language': 'zh-CN,zh;q=0.9',  'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'                ' Chrome/80.0.3987.149 Safari/537.36',}
# Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {#    'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,#}
# Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {#    'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,#}
# Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {#    'scrapy.extensions.telnet.TelnetConsole': None,#}
# Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   # 'XcSpider.pipelines.XcspiderPipeline': 300,    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,    'XcSpider.pipelines.XcPipeline': 200,}# Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 开启日志LOG_FILE = 'xcdl.log'LOG_LEVEL = 'ERROR'LOG_ENABLED = True
# Mysql相关配置MYSQL_HOST = '127.0.0.1'MYSQL_USER = 'root'MYSQL_PASSWORD = 'root'MYSQL_PORT = 3306MYSQL_DB = 'db_xici'
# MongoDB 相关配置# MONGODB 主机名MONGODB_HOST = '127.0.0.1'# MONGODB 端口号MONGODB_PORT = 27017# 数据库名称MONGODB_DBNAME = 'XCDL'# 存放数据的表名称MONGODB_SHEETNAME = 'xicidaili'

复制代码

6. items.py 文件

编写自己所需爬取的数据

# -*- coding: utf-8 -*-
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy

class XcspiderItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    pass
class XiciDailiItem(scrapy.Item):    country = scrapy.Field()    ipaddress = scrapy.Field()    port = scrapy.Field()    serveraddr = scrapy.Field()    isanonymous = scrapy.Field()    type = scrapy.Field()    alivetime = scrapy.Field()    verificationtime = scrapy.Field()

复制代码

7. xcdl.py

进行页面处理，提取需要的数据

# -*- coding: utf-8 -*-import scrapyfrom XcSpider.items import XiciDailiItem
class XcdlSpider(scrapy.Spider):    name = 'xcdl'    allowed_domains = ['xicidaili.com']    start_urls = ['https://www.xicidaili.com/']
    def parse(self, response):        # print(response.body.decode('utf-8'))        items_1 = response.xpath('//tr[@class="odd"]')        items_2 = response.xpath('//tr[@class=""]')        items = items_1 + items_2
        infos = XiciDailiItem()        for item in items:            # 获取国家图片链接            counties = item.xpath('./td[@class="country"]/img/@src').extract()            try:                country = counties[0]            except:                country = 'None'            # 获取ipaddress            ipaddress = item.xpath('./td[2]/text()').extract()            try:                ipaddress = ipaddress[0]            except:                ipaddress = 'None'            # 获取port            port = item.xpath('./td[3]/text()').extract()            try:                port = port[0]            except:                port = 'None'            # 获取serveraddr            serveraddr = item.xpath('./td[4]/text()').extract()            try:                serveraddr = serveraddr[0]            except:                serveraddr = 'None'            # 获取isanonymous            isanonymous = item.xpath('./td[5]/text()').extract()            try:                isanonymous = isanonymous[0]            except:                isanonymous = 'None'            # 获取type            type = item.xpath('./td[6]/text()').extract()            try:                type = type[0]            except:                type = 'None'            # 获取存活时间            alivetime = item.xpath('./td[7]/text()').extract()            try:                alivetime = alivetime[0]            except:                alivetime = 'None'            # 获取验证时间            verficationtime = item.xpath('./td[8]/text()').extract()            try:                verificationtime = verficationtime[0]            except:                verificationtime = 'None'
            print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
            infos['country'] = country            infos['ipaddress'] = ipaddress            infos['port'] = port            infos['serveraddr'] = serveraddr            infos['isanonymous'] = isanonymous            infos['type'] = type            infos['alivetime'] = alivetime            infos['verificationtime'] = verificationtime

            yield infos

复制代码

8. pipelines.py

i. 存入 MongoDB 数据库

数据我们已经提取出来了，现在我们可以存入数据库了，先写 MongoDB 的 pipeline，这里我们直接在项目中的 pipelines.py 文件中编写即可

# -*- coding: utf-8 -*-
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongofrom XcSpider import settings
class XcspiderPipeline(object):    def process_item(self, item, spider):        return item
class XcPipeline(object):    def __init__(self):        host = settings.MONGODB_HOST        port = settings.MONGODB_PORT        dbname = settings.MONGODB_DBNAME        sheetname = settings.MONGODB_SHEETNAME        # 创建MONGODB数据库连接        client = pymongo.MongoClient(host=host, port=port)        # 指定数据库        mydb = client[dbname]        # 存放数据的数据库表名        self.post = mydb[sheetname]
    def process_item(self, item, spider):        data = dict(item)        self.post.insert(data)        return item

复制代码

ii. 存入 MySQL 数据库

存入 MySQL 数据库我们可以自定义自己的 pipelines
在项目文件夹下新建一个 mysqlpipelines 文件夹或者 Package，具体位置可查看前面树结构
首先，我们先编写一个 sql 模板 --> sql.py

# -*- coding: UTF-8 -*-'''=================================================@Project -> File   ：project -> sql@IDE    ：PyCharm@Author ：ruochen@Date   ：2020/4/3 12:53@Desc   =================================================='''import pymysqlfrom XcSpider import settings
MYSQL_HOST = settings.MYSQL_HOSTMYSQL_USER = settings.MYSQL_USERMYSQL_PASSWORD = settings.MYSQL_PASSWORDMYSQL_PORT = settings.MYSQL_PORTMYSQL_DB = settings.MYSQL_DB
db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")cursor = db.cursor()
class Sql(object):
    @classmethod    def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime):        sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \              ' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) '        value = {            'country': country,            'ipaddress': ipaddress,            'port': port,            'serveraddr': serveraddr,            'isanonymous': isanonymous,            'type': type,            'alivetime': alivetime,            'verificationtime': verificationtime,        }        try:            cursor.execute(sql, value)            db.commit()        except Exception as e:            print('插入失败----', e)            db.rollback()
    # 去重    @classmethod    def select_name(cls, ipaddress):        sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)"        value = {            'ipaddress': ipaddress        }        cursor.execute(sql, value)        return cursor.fetchall()[0]

复制代码

然后是管道文件 mysqlpipelines\pipelines.py

# -*- coding: UTF-8 -*-'''=================================================@Project -> File   ：project -> pipelines@IDE    ：PyCharm@Author ：ruochen@Date   ：2020/4/3 12:53@Desc   ：=================================================='''from XcSpider.items import XiciDailiItemfrom .sql import Sql
class XicidailiPipeline(object):    def process_item(self, item, spider):        if isinstance(item, XiciDailiItem):            ipaddress = item['ipaddress']            ret = Sql.select_name(ipaddress)            if ret[0] == 1:                print("ip: {} 已经存在啦----".format(ipaddress))            else:                country = item['country']                ipaddress = item['ipaddress']                port = item['port']                serveraddr = item['serveraddr']                isanonymous = item['isanonymous']                type = item['type']                alivetime = item['alivetime']                verificationtime = item['verificationtime']
                Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)

复制代码

9. settings.py 中 pipelines 设置

前面 settings.py 文件已经有添加，这里再说一次
一个是 MySQL 的中间件，一个是 MongoDB 的中间件
优先级可以随便设置
两个可以同时打开，可也单独打开

这里，给大家提供一个小技巧，我们可以先以导包的形式找到我们的 pipelines ，然后复制过去即可，如下

# from XcSpider.mysqlpipelines.pipelines import XicidailiPipelineITEM_PIPELINES = {   # 'XcSpider.pipelines.XcspiderPipeline': 300,    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,    'XcSpider.pipelines.XcPipeline': 200,}

复制代码

10. 运行程序

现在，我们就可以运行 main.py 文件来启动我们的爬虫程序了
然后就可以在数据库中看到爬取的数据了

$存入 M o n g o D B 数据库可以直接运行$ $存入 M y S Q L 数据库需要先创建数据库和数据表，数据表名称是 x i c i d a i l i ，我把建表语句贴在下面$

Create Table: CREATE TABLE `xicidaili` (  `id` int(255) unsigned NOT NULL AUTO_INCREMENT,  `country` varchar(1000) NOT NULL,  `ipaddress` varchar(1000) NOT NULL,  `port` int(255) NOT NULL,  `serveraddr` varchar(50) NOT NULL,  `isanonymous` varchar(30) NOT NULL,  `type` varchar(30) NOT NULL,  `alivetime` varchar(30) NOT NULL,  `verificationtime` varchar(30) NOT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;

复制代码

end. 运行结果

MySQL 数据库

MongoDB 数据库

发布于: 46 分钟前阅读数: 2

原文链接:【http://xie.infoq.cn/article/c5a0a6213e33515c317582302】。文章转载请联系作者。

若尘

关注

还未添加个人签名 2021.01.11 加入

还未添加个人简介

发布

暂无评论

创作场景

Scrapy 爬取西刺代理存入 MySQL & MongoDB 数据库（手把手教学，超详细步骤）

Scrapy 爬取西刺代理

1. 创建项目

2. 创建爬虫实例

3. 创建一个启动文件 main.py

4. 项目的总体树结构

5. settings.py 文件配置

6. items.py 文件

7. xcdl.py

8. pipelines.py

i. 存入 MongoDB 数据库

ii. 存入 MySQL 数据库

9. settings.py 中 pipelines 设置

10. 运行程序

end. 运行结果

MySQL 数据库

MongoDB 数据库

若尘

评论