用 python 爬虫保存美国农业部网站上的水果图片

 关注
发布于: 2020 年 04 月 25 日
导语美国农业部为全世界已知水果制作了 7500 幅水彩「证件照」并提供高清下载，链接在 这里 .
《草莓》
﻿
这次的爬虫的目的是爬取这些证件照并保存到本地磁盘。
﻿
分析原页面共收录了7584张图片，分为380页，每页20条。
﻿
第一页的链接： https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=0
第二页的链接： https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start=20
...
以此类推，还是比较简单的。
﻿
每条数据的HTML元素布局如下：
﻿
﻿
我们可以获取到：
artist
year
scientific name
common name
缩略图的url
﻿
点击图片进入到详情页面：
﻿
﻿
点击click to enlarge按钮，我们就可以获取到原图了。
﻿
但是这样的话就意味着每张图都要打开一个新的页面，后来发现缩略图的url和原图的url有关联：
缩略图url， ../download/POM00007435/thumbnail
原图url， https://usdawatercolors.nal.usda.gov/download/POM00007435/screen
﻿
我们只要从缩略图的url中获取到POM00007435，就可以拼出对应的原图地址了。
﻿
爬虫﻿
依赖requests
beautifulsoup4
﻿
源码﻿
循环380次，对应380页
每个页面获取20条记录对应的html标签
对于每个html标签
获取artist，year等信息
从缩略图url拼出对应的原图url
下载原图，保存到本地
﻿
import requests
from bs4 import BeautifulSoup
IMGFOLDER = 'fruitimages/'
def run():
    for (idx, page) in enumerate(range(380)):
        resp = requests.get(
            'https://usdawatercolors.nal.usda.gov/pom/search.xhtml?start={}&searchText=&searchField=&sortField='.format(
                idx  20))
        soup = BeautifulSoup(resp.text, 'html.parser')
        for (dividx, div) in enumerate(soup.select('div.document')):
            doc = div.selectone('dl.defList')
            artist = doc.selectone(':nth-child(2)>a').gettext()
            year = doc.selectone(':nth-child(4)>a').gettext()
            # cannot parse scientific name or common name for some pictures, just use 'none' instead to avoid terminating
            scientificname = 'none' if doc.selectone(':nth-child(6)>a') is None else doc.selectone(
                ':nth-child(6)>a').gettext()
            commonname = 'none' if doc.selectone(':nth-child(8)>a') is None else doc.selectone(
                ':nth-child(8)>a').gettext()
            thumbpicsrc = div.selectone('div.thumb-frame>a>img')['src']
            id = (idx + 1)  20 + dividx + 1
            info = FruitInfo(id, artist, year, scientificname, commonname, thumbpicsrc)
            print(info)
            info.downloadandsave()
class FruitInfo:
    def init(self, id, artist, year, scientificname, commonname, thumbpicurl):
        self.id = id
        self.artist = artist
        self.year = year
        self.scientificname = scientificname
        self.commonname = commonname
        self.thumbpicurl = thumbpicurl
    def downloadandsave(self):
        filename = '{}-{}-{}-{}.png'.format(self.id, self.commonname, self.year, self.artist).replace(' ', '')
        print('filename = ', filename)
        oriimgurl = self._parseoriimgurl()
        print('original img url = ', oriimgurl)
        resp = requests.get(oriimgurl)
        with open(IMGFOLDER + filename, 'wb') as f:
            f.write(resp.content)
            print('saved...', filename)
    def parseoriimgurl(self) -> str:
        imgid = self.thumbpicurl.split('/')[2]
        print('img id = ', imgid)
        return 'https://usdawatercolors.nal.usda.gov/download/{}/screen'.format(imgid)
    def str(self):
        return 'FruitInfo(artist={},year={},scientificname={},commonname={},thumbpicurl={})'.format(self.artist,
                                                                                                        self.year,
                                                                                                        self.scientificname,
                                                                                                        self.commonname,
                                                                                                        self.thumbpic_url)
if name == 'main':
    run()
﻿
本地运行需要设置代理，否则打不开美国农业部的网站
﻿
Githubusda-fruit-img-spider
打包好的images.zip， 1.1Gb
﻿