Python 如何获取页面上某个元素指定区域的 html 源码？

作者：虫无涯

2023-07-17
陕西
本文字数：2263 字
阅读完需：约 7 分钟

1 需求来源

自动化测试中，有时候需要获取某个元素所在区域的页面源码，用于后续的对比分析或者他用；
另外在 pa chong 中可能需要获取某个元素所在区域的页面源码，然后原格式保存下来，比如保存为 html 或者 excel 格式数据等。

2 测试对象

获取博客园首页右侧的【48 小时阅读排行】词条；
获取博客园首页右侧的【10 天推荐排行】词条。

3 需求实现

3.1 使用 selenium 实现

3.1.1 实现过程

查看博客园首页右侧的【48 小时阅读排行】元素 xpath 属性；

复制其xpath：'//*[@id="side_right"]/div[3]'；
查看博客园首页右侧的【10 天推荐排行】元素 xpath 属性：
复制其xpath：'//*[@id="side_right"]/div[4]'；
使用 selenium 的get_attribute('outerHTML')方法进行这两个元素的outerHTML获取：

3.1.2 源码

# -*- coding:utf-8 -*-# 作者：NoamaNelson# 日期：2022/10/13 # 文件名称：test_selenium_otherHTML.py# 作用：xxx# 联系：VX(NoamaNelson)# 博客：https://blog.csdn.net/NoamaNelson
from selenium import webdriverimport time

content_list = ["content_48_h", "content_10_d"]el_xpath = ['//*[@id="side_right"]/div[3]',            '//*[@id="side_right"]/div[4]']content = []
driver = webdriver.Chrome()driver.get("https://www.cnblogs.com/")time.sleep(2)
for i in range(0, 2):    content_list[i] = driver.find_element_by_xpath(el_xpath[i])    content.append(content_list[i].get_attribute('outerHTML'))
print(f"48小时阅读排行为：{content[0]}",      f"10天推荐排行为：{content[1]}")
time.sleep(2)driver.quit()

复制代码

3.2 使用 requests + lxml.etree 实现

3.2.1 实现过程

同样获取对应的元素的 xapth：

# 48小时阅读排行'//*[@id="side_right"]/div[3]'
# 10天推荐排行'//*[@id="side_right"]/div[4]'

复制代码

先使用requests的get方法进入网站：

res = requests.get('https://www.cnblogs.com/',       verify=False,       headers=headers)

复制代码

使用etree方法解析：

tree = etree.HTML(res.content)

复制代码

找到对应的 xpath，对应的内容：

tree.xpath('//*[@id="side_right"]/div[3]')tree.xpath('//*[@id="side_right"]/div[4]')

复制代码

3.2.2 源码

from lxml import etreeimport requests
content_list = ["content_48_h", "content_10_d"]el_xpath = ['//*[@id="side_right"]/div[3]',            '//*[@id="side_right"]/div[4]']content = []
headers = {'Connection': 'close'}
res = requests.get('https://www.cnblogs.com/', verify=False, headers=headers)tree = etree.HTML(res.content)for i in range(0, 2):    content_list[i] = tree.xpath(el_xpath[i])    print(content_list[i])    content.append(etree.tostring(content_list[i][0], encoding='utf-8'))print(f"48小时阅读排行为：{content[0]},",      f"10天推荐排行为：{content[1]}")

复制代码

运行以上代码后，发现报错了。。。

  File "F:\python_study\test_selenium_otherHTML.py", line 24, in <module>    content.append(etree.tostring(content_list[i][0], encoding='utf-8'))IndexError: list index out of range[]

复制代码

从结果看，发现找到的对应xpath页面的内容为空，那么可以猜测是因为这个https://www.cnblogs.com/下没有对应的'//*[@id="side_right"]/div[3]'或'//*[@id="side_right"]/div[4]'

3.2.3 问题排查

3.2.3.1 获取该网址下的源码

使用 fiddler 抓包https://www.cnblogs.com/下的源码，进行查找我们的关键字【48 小时阅读排行】和【10 天推荐排行】：

复制返回的数据用 vscode 打开后查找以上关键字：

发现没有查找到结果，那么可以证实我们说的https://www.cnblogs.com/下没有对应的'//*[@id="side_right"]/div[3]'或'//*[@id="side_right"]/div[4]'，换言之，我们需要的元素不在这个页面，虽然我们但从网页看是在同一页面，但可能是其他页面加载出来的。所以我们得找到这个原色所在的页面，重新进行定位。

3.2.3.2 使用 fiddler 找该元素所在网页和属性

打开 fiddler 后，我们继续访问https://www.cnblogs.com/；
往下看，找到接口https://www.cnblogs.com/aggsite/SideRight后，发现返回值里边有我们需要的关键字，那么这个接口地址才是我们需要的，而不是https://www.cnblogs.com/；

我们复制接口https://www.cnblogs.com/aggsite/SideRight的返回值到 vscode 中，并进行运行：

可以看到我们需要的关键字就在以上接口中，所以先确定好我们所需要的关键字的请求接口为：https://www.cnblogs.com/aggsite/SideRight；
然后我们从以上运行的页面中，获取真正的【48 小时阅读排行】和【10 天推荐排行】的元素的属性（xpath）。如下：

# 48小时阅读排行'/html/body/div[1]/ul',
# 10天推荐排行'/html/body/div[2]/ul'

复制代码

3.2.4 修正后的源码

from lxml import etreeimport requests
content_list = ["content_48_h", "content_10_d"]el_xpath = ['/html/body/div[1]/ul',            '/html/body/div[2]/ul']content = []
headers = {'Connection': 'close'}
res = requests.get('https://www.cnblogs.com/aggsite/SideRight', verify=False, headers=headers)tree = etree.HTML(res.content)for i in range(0, 2):    content_list[i] = tree.xpath(el_xpath[i])    print(content_list[i])    content.append(etree.tostring(content_list[i][0], encoding='utf-8'))print(f"48小时阅读排行为：{content[0]},",      f"10天推荐排行为：{content[1]}")

复制代码

再次运行以上代码，OK 了。

发布于: 刚刚阅读数: 5

原文链接:【http://xie.infoq.cn/article/96ca840e0e8eb48571d1d656d】。文章转载请联系作者。

虫无涯

关注

专注测试领域各种技术研究、分享和交流~ 2019-12-11 加入

CSDN测试领域优质创作者 | CSDN博客专家 | 阿里云专家博主 | 华为云享专家 | 51CTO专家博主

发布

暂无评论

创作场景

Python 如何获取页面上某个元素指定区域的 html 源码？

1 需求来源

2 测试对象

3 需求实现

3.1 使用 selenium 实现

3.1.1 实现过程

3.1.2 源码

3.2 使用 requests + lxml.etree 实现

3.2.1 实现过程

3.2.2 源码

3.2.3 问题排查

3.2.3.1 获取该网址下的源码

3.2.3.2 使用 fiddler 找该元素所在网页和属性

3.2.4 修正后的源码

虫无涯

评论