Scrapy 爬取西刺代理存入 MySQL & MongoDB 数据库(手把手教学,超详细步骤)
Scrapy 爬取西刺代理
1. 创建项目
scrapy startproject XcSpider
2. 创建爬虫实例
scrapy genspider xcdl xicidaili.com
先把项目文件夹 Sources Root 一下,防止导入自己的文件时出错
3. 创建一个启动文件 main.py
from scrapy import cmdline
cmdline.execute('scrapy crawl xcdl'.split())
4. 项目的总体树结构
Windows 下查看树结构命令
tree /F
(/F 可显示完整文件)
│ main.py
│ scrapy.cfg
│ xcdl.log
│
└───XcSpider
│ items.py
│ middlewares.py
│ pipelines.py
│ settings.py
│ __init__.py
│
├───mysqlpipelines
│ │ pipelines.py
│ │ sql.py
│ │ __init__.py
│ │
│ └───__pycache__
│ pipelines.cpython-36.pyc
│ sql.cpython-36.pyc
│ __init__.cpython-36.pyc
│
├───spiders
│ │ xcdl.py
│ │ __init__.py
│ │
│ └───__pycache__
│ xcdl.cpython-36.pyc
│ __init__.cpython-36.pyc
│
└───__pycache__
items.cpython-36.pyc
pipelines.cpython-36.pyc
settings.cpython-36.pyc
__init__.cpython-36.pyc
5. settings.py 文件配置
设置 MySQL、MongoDB 数据相关配置
设置 ITEM_PIPELINES (中间件最后添加),这个后面会有讲解
设置 DEFAULT_REQUEST_HEADERS (头部信息),防止反爬,我们加入头部信息
# -*- coding: utf-8 -*-
# Scrapy settings for XcSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'XcSpider'
SPIDER_MODULES = ['XcSpider.spiders']
NEWSPIDER_MODULE = 'XcSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'XcSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.9',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)'
' Chrome/80.0.3987.149 Safari/537.36',
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'XcSpider.middlewares.XcspiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'XcSpider.pipelines.XcspiderPipeline': 300,
'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
'XcSpider.pipelines.XcPipeline': 200,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 开启日志
LOG_FILE = 'xcdl.log'
LOG_LEVEL = 'ERROR'
LOG_ENABLED = True
# Mysql相关配置
MYSQL_HOST = '127.0.0.1'
MYSQL_USER = 'root'
MYSQL_PASSWORD = 'root'
MYSQL_PORT = 3306
MYSQL_DB = 'db_xici'
# MongoDB 相关配置
# MONGODB 主机名
MONGODB_HOST = '127.0.0.1'
# MONGODB 端口号
MONGODB_PORT = 27017
# 数据库名称
MONGODB_DBNAME = 'XCDL'
# 存放数据的表名称
MONGODB_SHEETNAME = 'xicidaili'
6. items.py 文件
编写自己所需爬取的数据
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class XcspiderItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
pass
class XiciDailiItem(scrapy.Item):
country = scrapy.Field()
ipaddress = scrapy.Field()
port = scrapy.Field()
serveraddr = scrapy.Field()
isanonymous = scrapy.Field()
type = scrapy.Field()
alivetime = scrapy.Field()
verificationtime = scrapy.Field()
7. xcdl.py
进行页面处理,提取需要的数据
# -*- coding: utf-8 -*-
import scrapy
from XcSpider.items import XiciDailiItem
class XcdlSpider(scrapy.Spider):
name = 'xcdl'
allowed_domains = ['xicidaili.com']
start_urls = ['https://www.xicidaili.com/']
def parse(self, response):
# print(response.body.decode('utf-8'))
items_1 = response.xpath('//tr[@class="odd"]')
items_2 = response.xpath('//tr[@class=""]')
items = items_1 + items_2
infos = XiciDailiItem()
for item in items:
# 获取国家图片链接
counties = item.xpath('./td[@class="country"]/img/@src').extract()
try:
country = counties[0]
except:
country = 'None'
# 获取ipaddress
ipaddress = item.xpath('./td[2]/text()').extract()
try:
ipaddress = ipaddress[0]
except:
ipaddress = 'None'
# 获取port
port = item.xpath('./td[3]/text()').extract()
try:
port = port[0]
except:
port = 'None'
# 获取serveraddr
serveraddr = item.xpath('./td[4]/text()').extract()
try:
serveraddr = serveraddr[0]
except:
serveraddr = 'None'
# 获取isanonymous
isanonymous = item.xpath('./td[5]/text()').extract()
try:
isanonymous = isanonymous[0]
except:
isanonymous = 'None'
# 获取type
type = item.xpath('./td[6]/text()').extract()
try:
type = type[0]
except:
type = 'None'
# 获取存活时间
alivetime = item.xpath('./td[7]/text()').extract()
try:
alivetime = alivetime[0]
except:
alivetime = 'None'
# 获取验证时间
verficationtime = item.xpath('./td[8]/text()').extract()
try:
verificationtime = verficationtime[0]
except:
verificationtime = 'None'
print(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
infos['country'] = country
infos['ipaddress'] = ipaddress
infos['port'] = port
infos['serveraddr'] = serveraddr
infos['isanonymous'] = isanonymous
infos['type'] = type
infos['alivetime'] = alivetime
infos['verificationtime'] = verificationtime
yield infos
8. pipelines.py
i. 存入 MongoDB 数据库
数据我们已经提取出来了,现在我们可以存入数据库了,先写 MongoDB 的 pipeline,这里我们直接在项目中的 pipelines.py 文件中编写即可
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from XcSpider import settings
class XcspiderPipeline(object):
def process_item(self, item, spider):
return item
class XcPipeline(object):
def __init__(self):
host = settings.MONGODB_HOST
port = settings.MONGODB_PORT
dbname = settings.MONGODB_DBNAME
sheetname = settings.MONGODB_SHEETNAME
# 创建MONGODB数据库连接
client = pymongo.MongoClient(host=host, port=port)
# 指定数据库
mydb = client[dbname]
# 存放数据的数据库表名
self.post = mydb[sheetname]
def process_item(self, item, spider):
data = dict(item)
self.post.insert(data)
return item
ii. 存入 MySQL 数据库
存入 MySQL 数据库我们可以自定义自己的 pipelines
在项目文件夹下新建一个 mysqlpipelines 文件夹或者 Package,具体位置可查看前面树结构
首先,我们先编写一个 sql 模板 --> sql.py
# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File :project -> sql
@IDE :PyCharm
@Author :ruochen
@Date :2020/4/3 12:53
@Desc
=================================================='''
import pymysql
from XcSpider import settings
MYSQL_HOST = settings.MYSQL_HOST
MYSQL_USER = settings.MYSQL_USER
MYSQL_PASSWORD = settings.MYSQL_PASSWORD
MYSQL_PORT = settings.MYSQL_PORT
MYSQL_DB = settings.MYSQL_DB
db = pymysql.connect(user=MYSQL_USER, password=MYSQL_PASSWORD, host=MYSQL_HOST, port=MYSQL_PORT, database=MYSQL_DB, charset="utf8")
cursor = db.cursor()
class Sql(object):
@classmethod
def insert_db_xici(cls, country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime):
sql = 'insert into xicidaili(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)' \
' values (%(country)s, %(ipaddress)s, %(port)s, %(serveraddr)s, %(isanonymous)s, %(type)s, %(alivetime)s, %(verificationtime)s) '
value = {
'country': country,
'ipaddress': ipaddress,
'port': port,
'serveraddr': serveraddr,
'isanonymous': isanonymous,
'type': type,
'alivetime': alivetime,
'verificationtime': verificationtime,
}
try:
cursor.execute(sql, value)
db.commit()
except Exception as e:
print('插入失败----', e)
db.rollback()
# 去重
@classmethod
def select_name(cls, ipaddress):
sql = "select exists(select 1 from xicidaili where ipaddress=%(ipaddress)s)"
value = {
'ipaddress': ipaddress
}
cursor.execute(sql, value)
return cursor.fetchall()[0]
然后是管道文件 mysqlpipelines\pipelines.py
# -*- coding: UTF-8 -*-
'''=================================================
@Project -> File :project -> pipelines
@IDE :PyCharm
@Author :ruochen
@Date :2020/4/3 12:53
@Desc :
=================================================='''
from XcSpider.items import XiciDailiItem
from .sql import Sql
class XicidailiPipeline(object):
def process_item(self, item, spider):
if isinstance(item, XiciDailiItem):
ipaddress = item['ipaddress']
ret = Sql.select_name(ipaddress)
if ret[0] == 1:
print("ip: {} 已经存在啦----".format(ipaddress))
else:
country = item['country']
ipaddress = item['ipaddress']
port = item['port']
serveraddr = item['serveraddr']
isanonymous = item['isanonymous']
type = item['type']
alivetime = item['alivetime']
verificationtime = item['verificationtime']
Sql.insert_db_xici(country, ipaddress, port, serveraddr, isanonymous, type, alivetime, verificationtime)
9. settings.py 中 pipelines 设置
前面 settings.py 文件已经有添加,这里再说一次
一个是 MySQL 的中间件,一个是 MongoDB 的中间件
优先级可以随便设置
两个可以同时打开,可也单独打开
这里,给大家提供一个小技巧,我们可以先以导包的形式找到我们的 pipelines ,然后复制过去即可,如下
# from XcSpider.mysqlpipelines.pipelines import XicidailiPipeline
ITEM_PIPELINES = {
# 'XcSpider.pipelines.XcspiderPipeline': 300,
'XcSpider.mysqlpipelines.pipelines.XicidailiPipeline': 300,
'XcSpider.pipelines.XcPipeline': 200,
}
10. 运行程序
现在,我们就可以运行 main.py 文件来启动我们的爬虫程序了
然后就可以在数据库中看到爬取的数据了
存入MongoDB数据库可以直接运行存入MySQL数据库需要先创建数据库和数据表,数据表名称是xicidaili,我把建表语句贴在下面
Create Table: CREATE TABLE `xicidaili` (
`id` int(255) unsigned NOT NULL AUTO_INCREMENT,
`country` varchar(1000) NOT NULL,
`ipaddress` varchar(1000) NOT NULL,
`port` int(255) NOT NULL,
`serveraddr` varchar(50) NOT NULL,
`isanonymous` varchar(30) NOT NULL,
`type` varchar(30) NOT NULL,
`alivetime` varchar(30) NOT NULL,
`verificationtime` varchar(30) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=64 DEFAULT CHARSET=utf8;
end. 运行结果
MySQL 数据库
MongoDB 数据库
版权声明: 本文为 InfoQ 作者【若尘】的原创文章。
原文链接:【http://xie.infoq.cn/article/c5a0a6213e33515c317582302】。文章转载请联系作者。
若尘
还未添加个人签名 2021.01.11 加入
还未添加个人简介
评论