数据解析 – 我的技术博客

数据解析

聚焦爬虫的核心目标是从 HTML 页面中提取局部数据，而非整页存储。数据解析通用流程：

定位包含目标数据的 HTML 标签
提取标签中的文本内容或属性值

Python 常用三种解析方式：正则表达式、BeautifulSoup、XPath。

正则表达式

适合简单的固定模式匹配，但对复杂 HTML 结构容易出错（HTML 结构变动后正则需重写）。

import re
import requests

headers = {"User-Agent": "Mozilla/5.0"}
html = requests.get("https://example.com", headers=headers, timeout=10).text

# 提取所有图片链接
pattern = r'<img[^>]+src="([^"]+)"'
img_urls = re.findall(pattern, html)
print(img_urls[:5])

BeautifulSoup（推荐）

BeautifulSoup 4 是最易用的 HTML 解析库，配合 lxml 解析器性能更好。

pip install beautifulsoup4 lxml

基本用法

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get("https://books.toscrape.com/", headers=headers, timeout=10)
resp.raise_for_status()

# 推荐用 lxml 解析器（速度快，容错强）
soup = BeautifulSoup(resp.text, "lxml")

标签定位

# 直接访问第一个匹配的标签
title = soup.title              # <title>...</title>
first_h1 = soup.h1             # 页面第一个 h1

# find：找到第一个匹配的标签
tag = soup.find("div", class_="product_pod")
tag = soup.find("a", href=True)

# find_all：找到所有匹配的标签，返回列表
articles = soup.find_all("article", class_="product_pod")
links = soup.find_all("a", limit=10)   # 最多 10 个

# select：CSS 选择器（推荐，写法灵活）
items = soup.select("article.product_pod")           # 类选择器
prices = soup.select(".price_color")                 # class
header = soup.select_one("div#page_header")          # id，取第一个
nested = soup.select("div.content > ul > li > a")    # 层级

提取文本与属性

for article in soup.select("article.product_pod"):
    # 文本提取
    title = article.h3.a["title"]           # 属性值
    price = article.select_one(".price_color").get_text(strip=True)
    rating = article.p["class"][1]          # 列表属性，取第二个元素

    # tag.string：直系文本（唯一子文本节点时有效）
    # tag.get_text()：所有子节点文本合并（strip=True 去空白）
    print(f"{title}: {price} (rating: {rating})")

完整案例：爬取书单

import requests
from bs4 import BeautifulSoup
from pathlib import Path

headers = {"User-Agent": "Mozilla/5.0"}
base_url = "https://books.toscrape.com/catalogue/page-{}.html"

results = []

for page in range(1, 4):   # 爬前 3 页
    url = base_url.format(page) if page > 1 else "https://books.toscrape.com/"
    resp = requests.get(url, headers=headers, timeout=10)
    resp.raise_for_status()

    soup = BeautifulSoup(resp.text, "lxml")

    for article in soup.select("article.product_pod"):
        results.append({
            "title": article.h3.a["title"],
            "price": article.select_one(".price_color").get_text(strip=True),
            "availability": article.select_one(".availability").get_text(strip=True),
        })

print(f"共爬取 {len(results)} 本书")
for book in results[:5]:
    print(book)

XPath（lxml）

XPath 是基于 XML 路径语言的定位方式，对结构化 HTML 精准度高，与 Scrapy 配合广泛。

pip install lxml

基本语法

表达式	含义
`//div`	任意位置的 div 标签
`/html/body/div`	从根开始的绝对路径
`//div[@class="box"]`	属性定位
`//div[contains(@class, "item")]`	属性模糊匹配
`//li[1]`	索引定位（从 1 开始）
`//a/text()`	直系文本
`//div//text()`	所有后代文本
`//a/@href`	属性值

使用示例

import requests
from lxml import etree

headers = {"User-Agent": "Mozilla/5.0"}
resp = requests.get("https://books.toscrape.com/", headers=headers, timeout=10)
resp.encoding = "utf-8"

# 从 HTML 字符串构建 etree 对象
tree = etree.HTML(resp.text)

# 全局定位
articles = tree.xpath('//article[@class="product_pod"]')

for article in articles:
    # 局部解析：使用 . 表示当前节点
    title_nodes = article.xpath('.//h3/a/@title')
    price_nodes = article.xpath('.//p[@class="price_color"]/text()')

    if title_nodes and price_nodes:
        print(f"{title_nodes[0]}: {price_nodes[0].strip()}")

解析本地 HTML 文件

from lxml import etree

# 解析本地文件
tree = etree.parse("local.html")   # 使用 parse()
results = tree.xpath('//div[@class="item"]/text()')
print(results)

三种方式对比

方式	优点	缺点	适用场景
正则	无依赖，灵活	复杂结构难维护	简单固定模式
BeautifulSoup	API 友好，容错强	速度略慢	大多数网页爬取
XPath	精准，性能好	语法较复杂	结构化 HTML，Scrapy

最后更新于 2026-06-23

requests进阶操作 Scrapy框架