Python 操作 BeautifulSoup4(爬取网页信息)

作者：度假的鱼🐟

2022-11-28
北京
本文字数：3500 字
阅读完需：约 11 分钟

1.BeautifulSoup4 介绍

BeautifulSoup4 是爬虫里面需要掌握的一个必备库，通过这个库，将使我们通过 requests 请求的页面解析变得简单无比，再也不用通过绞尽脑汁的去想如何正则该如何匹配内容了。（一入正则深似海虽然它使用起来效率很高效哈）

这篇文档介绍了 BeautifulSoup4 中基础操作,并且有小例子.让我来向你展示它适合做什么,如何工作,怎样使用,如何达到你想要的效果

1.1 BeautifulSoup4 是什么

Beautifulsoup4 是 Beautiful Soup 项目的第四个版本，也是当前的最新版本。

Beautiful Soup 是一个可以从 HTML 或 XML 文件中提取数据的 Python 库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.Beautiful Soup 会帮你节省数小时甚至数天的工作时间.

Beautiful Soup 对 Python 2 的支持已于 2020 年 12 月 31 日停止：从现在开始，新的 Beautiful Soup 开发将专门针对 Python 3。Beautiful Soup 4 支持 Python 2 的最终版本是 4.9.3。

HTML 文档本身是结构化的文本，有一定的规则，通过它的结构可以简化信息提取。于是，就有了 lxml、pyquery、BeautifulSoup 等网页信息提取库。一般我们会用这些库来提取网页信息。其中，lxml 有很高的解析效率，支持 xPath 语法（一种可以在 HTML 中查找信息的规则语法）；pyquery 得名于 jQuery（知名的前端 js 库），可以用类似 jQuery 的语法解析网页。但我们今天要说的，是剩下的这个：BeautifulSoup。

BeautifulSoup（下文简称 bs）翻译成中文就是“美丽的汤”，这个奇特的名字来源于《爱丽丝梦游仙境》（这也是为何在其官网会配上奇怪的插图，以及用《爱丽丝》的片段作为测试文本）。

1.2 使用之前对：数据结构中--‘树’的理解回顾

简单回顾一下数据结构中关于树的基本知识，脑海中有个树的样子哈

结点的概念

结点：上面的示意图中每一个数据元素都被称为"结点"。

结点的度：结点所拥有的子树的个数称为该结点的度。上图中 A 节点的子树的数量就是三个，它的度就是 3。

根结点：每一个非空树都有且只有一个被称为根的结点。上图中里面的 A 就是当前树的根节点。

子结点、父结点、兄弟结点：树中一个结点的子树的根结点称为这个结点的子结点，这个结点称为孩子结点的父结点。具有同一个父结点的子结点互称为兄弟结点。上图中 B、C、D 就是兄弟节点，同时也是 A 的孩子节点，C 是 G 双亲节点

叶子结点：度为 0 的结点称为叶子结点，或者称为终端结点。上图中的 K、M 就是叶子节点的代表

<!DOCTYPE html><html>    <head>        <meta charset="UTF-8">        <link rel="stylesheet" type="text/css" href="style.css">        <script type="application/javascript" src="script.js"></script>        <title>I’m the title</title>    </head>    <body>        <h1>HelloWorld</h1>        <div>            <div>                <p>picture:</p>                <img src="example.png"/>            </div>            <div>                <p>A paragraph of explanatory text...</p>            </div>        </div>    </body></html>

复制代码

上面的 HTML 源码通过 HTML 文档解析构建 DOM 树就会形成如下的效果

2.安装 BeautifulSoup4 模块库

# 安装BeautifulSoup4pip install BeautifulSoup4

复制代码

基本使用流程：通过文本初始化 bs 对象->通过 find/find_all 或其他方法检测信息->输出或保存

官方文档很友好，也有中文，推荐阅读：

官方中文版说明 https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

下表列出了主要的解析器,以及它们的优缺点:

2.1 案例基础操作

下面的一段 HTML 代码将作为例子练习

html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;    and they lived at the bottom of a well.</p>
<p class="story">...</p>"""

复制代码

分析

2.2 完整代码练习

# 导包from bs4 import BeautifulSoup
html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p>
<p class="story">...</p>"""# 创建对象html_doc（（使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:））soup = BeautifulSoup(html_doc, 'html.parser')
# 按照html标准的缩进格式的结构输出:print(soup.prettify())
# 1 获取title标签的所有内容print("1.获取title标签的所有内容:", soup.title)
# 2 获取title标签的名称print("2.获取title标签的名称:", soup.title.name)
# 3 获取title标签的文本内容print("3.获取title标签的文本内容:", soup.title.string)
# 4 获取head标签的所有内容print("4.获取head标签的所有内容:", soup.head)
# 5 获取第一个p标签中的所有内容print("5.获取第一个p标签中的所有内容:", soup.p)
# 6 获取第一个p标签的class的值print("6.获取第一个p标签的class的值:", soup.p["class"])
# 7 获取第一个a标签中的所有内容print("7.获取第一个a标签中的所有内容:", soup.a)
# 8 获取所有的a标签中的所有内容print("8.获取所有的a标签中的所有内容", soup.find_all("a"))
# 9 获取id="link2"print("9.获取id=link2", soup.find(id="link2"))## 10 获取所有的a标签，并遍历打印a标签中的href的值for item in soup.find_all("a"):    print(item.get("href"))
# 11 获取所有的a标签，并遍历打印a标签的文本值for item in soup.find_all("a"):    print(item.get_text())

复制代码

输出结果:"D:\Program Files1\Python\python.exe" D:/Pycharm-work/pythonTest/打卡/0818-BeautifulSoup4.py<html> <head>  <title>   The Dormouse's story  </title> </head> <body>  <p class="title">   <b>    The Dormouse's story   </b>  </p>  <p class="story">   Once upon a time there were three little sisters; and their names were   <a class="sister" href="http://example.com/elsie" id="link1">    Elsie   </a>   ,   <a class="sister" href="http://example.com/lacie" id="link2">    Lacie   </a>   and   <a class="sister" href="http://example.com/tillie" id="link3">    Tillie   </a>   ;and they lived at the bottom of a well.  </p>  <p class="story">   ...  </p> </body></html>1.获取title标签的所有内容: <title>The Dormouse's story</title>2.获取title标签的名称: title3.获取title标签的文本内容: The Dormouse's story4.获取head标签的所有内容: <head><title>The Dormouse's story</title></head>5.获取第一个p标签中的所有内容: <p class="title"><b>The Dormouse's story</b></p>6.获取第一个p标签的class的值: ['title']7.获取第一个a标签中的所有内容: <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>8.获取所有的a标签中的所有内容 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]9.获取id=link2 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>http://example.com/elsiehttp://example.com/laciehttp://example.com/tillieElsieLacieTillie
Process finished with exit code 0

复制代码

以上就是 BeautifulSoup 的一个极简上手介绍，对于 bs 能做什么，想必你已有了一个初步认识。如果你要在开发中使用，建议再看下它的官方文档。文档写得很清楚，也有中文版，你只要看了最初的一小部分，就可以在代码中派上用场了

发布于: 刚刚阅读数: 3

原文链接:【http://xie.infoq.cn/article/7c124a4744039a10d8fe3124c】。未经作者许可，禁止转载。

度假的鱼🐟

关注

一边做一边变得优秀,加油呀 2022-09-19 加入

大家好，我是小鱼新人来报道哈。 CSDN，阿里云专家.....

发布

暂无评论

创作场景

Python 操作 BeautifulSoup4(爬取网页信息)

1.BeautifulSoup4 介绍

1.1 BeautifulSoup4 是什么

1.2 使用之前对：数据结构中--‘树’的理解 回顾

2.安装 BeautifulSoup4 模块库

2.1 案例基础操作

2.2 完整代码练习

度假的鱼🐟

评论

1.2 使用之前对：数据结构中--‘树’的理解回顾