扩展新规则url爬虫类方法:
1. 安装运行环境
Python3.8:https://www.python.org/downloads/
Scrapy2.4.1:https://scrapy.org/
2. 新建网站子类,需要继承scrapy.Spider,并重写allowed_domains,start_urls 及parse函数:
class YourSpiderClass(scrapy.Spider):
name = 'get_et_rules'
allowed_domains = ['rules.emergingthreats.net']
start_urls = ['https://rules.emergingthreats.net/open/suricata-5.0/emerging-all.rules']
def parse(self, response):
# your parse code ... ...
pass
3. 在RulesSpider新增爬虫进程即可:
from scrapy.crawler import CrawlerProcess
class RulesSpider():
def __init__(self):
self.process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
self.process.crawl(GetEtRulesSpider)
self.process.crawl(GetSSLIpBlackListSpider)
self.process.crawl(GetSSLIpBlackListAggressiveSpider)
# your process class ... ...
def start(self):
self.process.start()
3. 执行main.py,显示如下:
2021-10-25 16:08:47 [scrapy.core.engine] INFO: Closing spider (finished)
2021-10-25 16:08:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 281,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 2830542,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 4.061754,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 10, 25, 8, 8, 47, 572493),
'log_count/DEBUG': 3,
'log_count/INFO': 34,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2021, 10, 25, 8, 8, 43, 510739)}
2021-10-25 16:08:47 [scrapy.core.engine] INFO: Spider closed (finished)
update nidps rule successfully, totally time costs 5s
old rules has 29765 items, new rules has 29778 items
incre 8011 items, del 7998 items
评论