写点什么

Scrapy 爬取西刺代理存入 MySQL & MongoDB 数据库(手把手教学,超详细步骤)

用户头像
若尘
关注
发布于: 46 分钟前
Scrapy 爬取西刺代理存入MySQL & MongoDB 数据库(手把手教学,超详细步骤)

Scrapy 爬取西刺代理

1. 创建项目

  • scrapy startproject XcSpider

2. 创建爬虫实例

  • scrapy genspider xcdl xicidaili.com


先把项目文件夹 Sources Root 一下,防止导入自己的文件时出错

3. 创建一个启动文件 main.py

from scrapy import cmdlinecmdline.execute('scrapy crawl xcdl'.split())
复制代码

4. 项目的总体树结构

Windows 下查看树结构命令 tree /F(/F 可显示完整文件)


│   main.py│   scrapy.cfg│   xcdl.log└───XcSpider    │   items.py    │   middlewares.py    │   pipelines.py    │   settings.py    │   __init__.py    ├───mysqlpipelines    │   │   pipelines.py    │   │   sql.py    │   │   __init__.py    │   │    │   └───__pycache__    │           pipelines.cpython-36.pyc    │           sql.cpython-36.pyc    │           __init__.cpython-36.pyc    ├───spiders    │   │   xcdl.py    │   │   __init__.py    │   │    │   └───__pycache__    │           xcdl.cpython-36.pyc    │           __init__.cpython-36.pyc    └───__pycache__            items.cpython-36.pyc            pipelines.cpython-36.pyc            settings.cpython-36.pyc            __init__.cpython-36.pyc
复制代码

5. settings.py 文件配置

  • 设置 MySQL、MongoDB 数据相关配置

  • 设置 ITEM_PIPELINES (中间件最后添加),这个后面会有讲解

  • 设置 DEFAULT_REQUEST_HEADERS (头部信息),防止反爬,我们加入头部信息


# -*- coding: utf-8 -*-
# Scrapy settings for XcSpider project## For simplicity, this file contains only settings considered important or# commonly used. You can find more settings consulting the documentation:## https://docs.scrapy.org/en/latest/topics/settings.html# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XcSpider'
SPIDER_MODULES = ['XcSpider.spiders']NEWSPIDER_MODULE = 'XcSpider.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'
# Obey robots.txt rulesROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# The download delay setting will honor only one of:#CONCURRENT_REQUESTS_PER_DOMAIN = 16#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)#TELNETCONSOLE_ENABLED = False
# Override the default request headers:DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.9', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)' ' Chrome/80.0.3987.149 Safari/537.36',}
# Enable or disable spider middlewares# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html#SPIDER_MIDDLEWARES = {# 'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,#}
# Enable or disable downloader middlewares# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#DOWNLOADER_MIDDLEWARES = {# 'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,#}
# Enable or disable extensions# See https://docs.scrapy.org/en/latest/topics/extensions.html#EXTENSIONS = {# 'scrapy.extensions.telnet.TelnetConsole': None,#}
# Configure item pipelines# See https://docs.scrapy.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = { # 'XcSpider.pipelines.XcspiderPipeline': 300, 'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300, 'XcSpider.pipelines.XcPipeline': 200,}# Enable and configure the AutoThrottle extension (disabled by default)# See https://docs.scrapy.org/en/latest/topics/autothrottle.html#AUTOTHROTTLE_ENABLED = True# The initial download delay#AUTOTHROTTLE_START_DELAY = 5# The maximum download delay to be set in case of high latencies#AUTOTHROTTLE_MAX_DELAY = 60# The average number of requests Scrapy should be sending in parallel to# each remote server#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0# Enable showing throttling stats for every response received:#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings#HTTPCACHE_ENABLED = True#HTTPCACHE_EXPIRATION_SECS = 0#HTTPCACHE_DIR = 'httpcache'#HTTPCACHE_IGNORE_HTTP_CODES = []#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 开启日志LOG_FILE = 'xcdl.log'LOG_LEVEL = 'ERROR'LOG_ENABLED = True
# Mysql相关配置MYSQL_HOST = '127.0.0.1'MYSQL_USER = 'root'MYSQL_PASSWORD = 'root'MYSQL_PORT = 3306MYSQL_DB = 'db_xici'
# MongoDB 相关配置# MONGODB 主机名MONGODB_HOST = '127.0.0.1'# MONGODB 端口号MONGODB_PORT = 27017# 数据库名称MONGODB_DBNAME = 'XCDL'# 存放数据的表名称MONGODB_SHEETNAME = 'xicidaili'
复制代码

6. items.py 文件

  • 编写自己所需爬取的数据


# -*- coding: utf-8 -*-
# Define here the models for your scraped items## See documentation in:# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy

class XcspiderItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() pass
class XiciDailiItem(scrapy.Item): country = scrapy.Field() ipaddress = scrapy.Field() port = scrapy.Field() serveraddr = scrapy.Field() isanonymous = scrapy.Field() type = scrapy.Field() alivetime = scrapy.Field() verificationtime = scrapy.Field()
复制代码

7. xcdl.py

  • 进行页面处理,提取需要的数据


# -*- coding: utf-8 -*-import scrapyfrom XcSpider.items import XiciDailiItem
class XcdlSpider(scrapy.Spider): name = 'xcdl' allowed_domains = ['xicidaili.com'] start_urls = ['https://www.xicidaili.com/']
def parse(self, response): # print(response.body.decode('utf-8')) items_1 = response.xpath('//tr[@class="odd"]') items_2 = response.xpath('//tr[@class=""]') items = items_1 + items_2
infos = XiciDailiItem() for item in items: # 获取国家图片链接 counties = item.xpath('./td[@class="country"]/img/@src').extract() try: country = counties[0] except: country = 'None' # 获取ipaddress ipaddress = item.xpath('./td[2]/text()').extract() try: ipaddress = ipaddress[0] except: ipaddress = 'None' # 获取port port = item.xpath('./td[3]/text()').extract() try: port = port[0] except: port = 'None' # 获取serveraddr serveraddr = item.xpath('./td[4]/text()').extract() try: serveraddr = serveraddr[0] except: serveraddr = 'None' # 获取isanonymous isanonymous = item.xpath('./td[5]/text()').extract() try: isanonymous = isanonymous[0] except: isanonymous = 'None' # 获取type type = item.xpath('./td[6]/text()').extract() try: type = type[0] except: type = 'None' # 获取存活时间 alivetime = item.xpath('./td[7]/text()').extract() try: alivetime = alivetime[0] except: alivetime = 'None' # 获取验证时间 verficationtime = item.xpath('./td[8]/text()').extract() try: verificationtime = verficationtime[0] except: verificationtime = 'None'
print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
infos['country'] = country infos['ipaddress'] = ipaddress infos['port'] = port infos['serveraddr'] = serveraddr infos['isanonymous'] = isanonymous infos['type'] = type infos['alivetime'] = alivetime infos['verificationtime'] = verificationtime

yield infos
复制代码

8. pipelines.py

i. 存入 MongoDB 数据库

  • 数据我们已经提取出来了,现在我们可以存入数据库了,先写 MongoDB 的 pipeline,这里我们直接在项目中的 pipelines.py 文件中编写即可


# -*- coding: utf-8 -*-
# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongofrom XcSpider import settings
class XcspiderPipeline(object): def process_item(self, item, spider): return item
class XcPipeline(object): def __init__(self): host = settings.MONGODB_HOST port = settings.MONGODB_PORT dbname = settings.MONGODB_DBNAME sheetname = settings.MONGODB_SHEETNAME # 创建MONGODB数据库连接 client = pymongo.MongoClient(host=host, port=port) # 指定数据库 mydb = client[dbname] # 存放数据的数据库表名 self.post = mydb[sheetname]
def process_item(self, item, spider): data = dict(item) self.post.insert(data) return item
复制代码

ii. 存入 MySQL 数据库

  • 存入 MySQL 数据库我们可以自定义自己的 pipelines

  • 在项目文件夹下新建一个 mysqlpipelines 文件夹或者 Package,具体位置可查看前面树结构

  • 首先,我们先编写一个 sql 模板 --> sql.py


# -*- coding: UTF-8 -*-'''=================================================@Project -> File   :project -> sql@IDE    :PyCharm@Author :ruochen@Date   :2020/4/3 12:53@Desc   =================================================='''import pymysqlfrom XcSpider import settings
MYSQL_HOST = settings.MYSQL_HOSTMYSQL_USER = settings.MYSQL_USERMYSQL_PASSWORD = settings.MYSQL_PASSWORDMYSQL_PORT = settings.MYSQL_PORTMYSQL_DB = settings.MYSQL_DB
db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")cursor = db.cursor()
class Sql(object):
@classmethod def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime): sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \ ' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) ' value = { 'country': country, 'ipaddress': ipaddress, 'port': port, 'serveraddr': serveraddr, 'isanonymous': isanonymous, 'type': type, 'alivetime': alivetime, 'verificationtime': verificationtime, } try: cursor.execute(sql, value) db.commit() except Exception as e: print('插入失败----', e) db.rollback()
# 去重 @classmethod def select_name(cls, ipaddress): sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)" value = { 'ipaddress': ipaddress } cursor.execute(sql, value) return cursor.fetchall()[0]
复制代码


  • 然后是管道文件 mysqlpipelines\pipelines.py


# -*- coding: UTF-8 -*-'''=================================================@Project -> File   :project -> pipelines@IDE    :PyCharm@Author :ruochen@Date   :2020/4/3 12:53@Desc   :=================================================='''from XcSpider.items import XiciDailiItemfrom .sql import Sql
class XicidailiPipeline(object): def process_item(self, item, spider): if isinstance(item, XiciDailiItem): ipaddress = item['ipaddress'] ret = Sql.select_name(ipaddress) if ret[0] == 1: print("ip: {} 已经存在啦----".format(ipaddress)) else: country = item['country'] ipaddress = item['ipaddress'] port = item['port'] serveraddr = item['serveraddr'] isanonymous = item['isanonymous'] type = item['type'] alivetime = item['alivetime'] verificationtime = item['verificationtime']
Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
复制代码

9. settings.py 中 pipelines 设置

  • 前面 settings.py 文件已经有添加,这里再说一次

  • 一个是 MySQL 的中间件,一个是 MongoDB 的中间件

  • 优先级可以随便设置

  • 两个可以同时打开,可也单独打开


这里,给大家提供一个小技巧,我们可以先以导包的形式找到我们的 pipelines ,然后复制过去即可,如下


# from XcSpider.mysqlpipelines.pipelines import XicidailiPipelineITEM_PIPELINES = {   # 'XcSpider.pipelines.XcspiderPipeline': 300,    'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,    'XcSpider.pipelines.XcPipeline': 200,}
复制代码

10. 运行程序

  • 现在,我们就可以运行 main.py 文件来启动我们的爬虫程序了

  • 然后就可以在数据库中看到爬取的数据了




Create Table: CREATE TABLE `xicidaili` (  `id` int(255) unsigned NOT NULL AUTO_INCREMENT,  `country` varchar(1000) NOT NULL,  `ipaddress` varchar(1000) NOT NULL,  `port` int(255) NOT NULL,  `serveraddr` varchar(50) NOT NULL,  `isanonymous` varchar(30) NOT NULL,  `type` varchar(30) NOT NULL,  `alivetime` varchar(30) NOT NULL,  `verificationtime` varchar(30) NOT NULL,  PRIMARY KEY (`id`)) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;
复制代码

end. 运行结果

MySQL 数据库

MongoDB 数据库


发布于: 46 分钟前阅读数: 2
用户头像

若尘

关注

还未添加个人签名 2021.01.11 加入

还未添加个人简介

评论

发布
暂无评论
Scrapy 爬取西刺代理存入MySQL & MongoDB 数据库(手把手教学,超详细步骤)