扩展新规则url爬虫类方法:1. 安装运行环境 Python3.8:https://www.python.org/downloads/ Scrapy2.4.1:https://scrapy.org/ 2. 新建网站子类,需要继承scrapy.Spider,并重写allowed_domains,start_urls 及parse函数: class YourSpiderClass(scrapy.Spider): name = 'get_et_rules' allowed_domains = ['rules.emergingthreats.net'] start_urls = ['https://rules.emergingthreats.net/open/suricata-5.0/emerging-all.rules'] def parse(self, response): # your parse code ... ... pass
3. 在RulesSpider新增爬虫进程即可: from scrapy.crawler import CrawlerProcess class RulesSpider(): def __init__(self): self.process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) self.process.crawl(GetEtRulesSpider) self.process.crawl(GetSSLIpBlackListSpider) self.process.crawl(GetSSLIpBlackListAggressiveSpider) # your process class ... ... def start(self): self.process.start() 3. 执行main.py,显示如下: 2021-10-25 16:08:47 [scrapy.core.engine] INFO: Closing spider (finished) 2021-10-25 16:08:47 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 281, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 2830542, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 4.061754, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2021, 10, 25, 8, 8, 47, 572493), 'log_count/DEBUG': 3, 'log_count/INFO': 34, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2021, 10, 25, 8, 8, 43, 510739)} 2021-10-25 16:08:47 [scrapy.core.engine] INFO: Spider closed (finished) update nidps rule successfully, totally time costs 5s old rules has 29765 items, new rules has 29778 items incre 8011 items, del 7998 items
评论