Haste makes waste

Python网络爬虫-【扩展】Beautiful Soup 4.4.0 文档

Posted on By lijun

关于beautiful soup库,更详细说明资料参考Beautiful Soup 4.4.0 文档文档日文版

参考:

  1. 莫烦Python-BeautifulSoup 解析

在接下来的工作中,估计会大量涉及HTML中内容的解析,所以对这个库务必做到非常熟悉!!

考虑自己做一个Django的网站,用于抽取指定网站上的QA列表,并支持编辑和CSV下载,类似于DX suite这样的。

0. 准备工作

本篇官方文档内容较多,整理mindmap如下:

image

import requests
from bs4 import BeautifulSoup
url = "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/cloth_mask_qa_.html"
r = requests.get(url)
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,"html.parser")

1. 快速开始

  • 定义一个html对象:
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,"lxml")

print(soup.prettify())

打印出来效果如下:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

1.1 常见操作

print("1:",soup.title)
print("1:",type(soup.title)) # 

print("\n2:",soup.title.next) # 单独的文本,是title的子节点
print("2:",type(soup.title.next))

print("\n3:",soup.title.parent)
print("3:",soup.title.parent.name)

打印如下:

1: <title>The Dormouse's story</title>
1: <class 'bs4.element.Tag'>

2: The Dormouse's story
2: <class 'bs4.element.NavigableString'>

3: <head><title>The Dormouse's story</title></head>
3: head
  • 定位查找
print("1:",soup.p)
print("1:",soup.p['class'])

print("\n2:",soup.p.next)
print("2:",soup.p.next.next)

print("\n3:",soup.find_all('a'))

print("\n4:",soup.find(id="link3")['href'])

输出

1: <p class="title"><b>The Dormouse's story</b></p>
1: ['title']

2: <b>The Dormouse's story</b>
2: The Dormouse's story

3: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4: http://example.com/tillie

注意这里取href的方式:,link是数组中的一个元素,这个元素获取href,可以用get函数。

for link in soup.find_all('a'):
    print(link.get('href'))
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

获取其中所有的文本:

print(soup.get_text()[0:200])
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.

2. 对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString, BeautifulSoup, Comment.

2.1 Tag

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml')
tag = soup.b
print("1:",type(tag))
print("2:",tag)

输出:

1: <class 'bs4.element.Tag'>
2: <b class="boldest">Extremely bold</b>

Tag最重要的两个属性,分别是nameattributes

name

print("1:",tag.name)
print("1:",tag)
tag.name = "blockquote"
print("2:",tag.name)
print("2:",tag)
tag.name = "b"

这里能给一个tag重新赋值:

1: b
1: <b class="boldest">Extremely bold</b>
2: blockquote
2: <blockquote class="boldest">Extremely bold</blockquote>

attributes

print("1:",tag.attrs)
print("2:",tag['class'])
1: {'class': ['boldest']}
2: ['boldest']

tag作为一个字典来进行处理,可以像字典一样访问,并对tag的属性进行追加,删除和修改。

tag['class'] = 'verybold'
tag['id'] = 1
print("1:",tag)
print("2:",tag.attrs)
1: <b class="verybold" id="1">Extremely bold</b>
2: {'class': 'verybold', 'id': 1}

删除:

del tag['id']
print("1:",tag)
1: <b class="verybold">Extremely bold</b>

有多个值的时候

css_soup = BeautifulSoup('<p class="body strikeout"></p>', "lxml")
print('1:',css_soup.p['class'])

css_soup = BeautifulSoup('<p class="body"></p>', "lxml")
print('2:',css_soup.p['class'])

输出:

1: ['body', 'strikeout']
2: ['body']

如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性 ,那么Beautiful Soup会将这个属性作为字符串返回

id_soup = BeautifulSoup('<p id="my id"></p>', "lxml")
id_soup.p['id']

输出:'my id'

2.2 可以遍历的字符串

Beautiful Soup用 NavigableString 类来包装tag中的字符串:

print("1:",tag.string)
print("2:",type(tag.string))
1: Extremely bold
2: <class 'bs4.element.NavigableString'>

tag中包含的字符串不能编辑,但是可以被替换成其它的字符串,用replace_with()方法:

tag.string.replace_with("No longer bold")
print("1:",tag)

输出:

1: <b class="boldest">No longer bold</b>

另外,NavigableString对象支持遍历文档树和搜索文档树中的大部分属性,但是一个字符串不能包含其他内容,字符串不支持.contents.string属性,find()方法等。

2.3 注释及特殊字符串

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup,'lxml')
comment = soup.b.string
print('1:',type(comment))
1: <class 'bs4.element.Comment'>

2.4 BeautifulSoup

Beautiful Soup对象表示的是一个文档的全部内容,大部分时候可以把它当做Tag对象,支持遍历文档树和搜索文档树的大部分方法。

3. 遍历文档树

拿下面的HTML举例子:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

3.1 子节点

一个Tag可能包含多个字符串或其他的Tag。

tag的名字,操作文档树最简单的方法就是告诉它想获取的tag的name,比如:

print("1:",soup.head)
print("2:",soup.title)
1: <head><title>The Dormouse's story</title></head>
2: <title>The Dormouse's story</title>

可以在文档树的tag中多次调用这个方法,下面的代码可以获取body标签中的第一个b标签:

注意: 通过这种点取得方式只能获得当前名字的第一个tag,如果想要得到所有的a标签,就要用到搜索树的方法,soup.find_all('a').

# 下面的代码可以获取<body>标签中的第一个<b>标签:
print("1:",soup.body.b)
# 获得当前名字的第一个tag:
print("2:",soup.a)
# 得到所有的<a>标签,
print("3:",soup.find_all('a'))
1: <b>The Dormouse's story</b>
2: <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
3: [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

[重要].contents 和 .children

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
print("1:",head_tag)

# 注意contents的返回值是个list
title_tag = head_tag.contents
print("2:",title_tag)

print("3:",title_tag[0].contents)
1: <head><title>The Dormouse's story</title></head>
2: [<title>The Dormouse's story</title>]
3: ["The Dormouse's story"]

通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag[0].children:
    print(child)
The Dormouse's story

.descendants

上面的.contents.children属性仅包含tag的直接子节点,如果要包含其孙节点,则需要用.descendants:

import requests
import re
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

for child in soup.head.children:
    print("1. **soup.head.children**: ",child)
    
for child in soup.head.descendants:
    print("2. **soup.head.descendants**: ",child)

输出结果,可以看到第二类将孙节点也输出了:

1. **soup.head.children**:  <title>The Dormouse's story</title>
2. **soup.head.descendants**:  <title>The Dormouse's story</title>
2. **soup.head.descendants**:  The Dormouse's story

.string

下面tag可以使用.string:

  • 如果tag只有一个NavigableString类型的子节点
  • 如果一个tag仅有一个子节点,那么这个tag也可以使用.string

注意.text.string的区别,.text可以取出tag内所有的文字即所有的navigabable string。

比如:

import requests
import re
from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print("1. soup.head.text:\n",soup.text)
print("2. soup.head.string:\n",soup.string)
print("3. soup.head.string:\n",soup.head.string)
print("4. soup.head.string:\n",soup.head.title.string)

输出结果:

1. soup.head.text:
 
The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

2. soup.head.string:
 None

3. soup.head.string:
 The Dormouse's story
4. soup.head.string:
 The Dormouse's story

.strings 和 stripped_strings

3.2 父节点

.parent

.parents

3.3 兄弟节点

.next_sibling 和 .previous_sibling

.next_siblings 和 .previous_siblings

3.4 回退和前进

.next_element 和 .previous_element

.next_elements 和 .previous_elements

4. 搜索文档树

4.1 过滤器

正则表达式

列表

True

方法

4.2 find_all()

name 参数

keyword 参数

按CSS搜索

string 参数

limit 参数

recursive 参数

4.3 其他find函数

像调用 find_all() 一样调用tag

find()

find_parents() 和 find_parent()

find_next_siblings() 和 find_next_sibling()

find_previous_siblings() 和 find_previous_sibling()

find_all_next() 和 find_next()

find_all_previous() 和 find_previous()

4.4 CSS选择器

5. 修改文档树

修改tag的名称和属性

修改 .string

append()

insert()

insert_before() 和 insert_after()

clear()

extract()

decompose()

replace_with()

wrap()

unwrap()

6. 输出

6.1 格式化输出

6.2 压缩输出

6.3 输出格式

6.4 get_text()

7. 指定文档解析器

解析器之间的区别

8. 编码

8.1 输出编码

8.2 Unicode, Dammit! (乱码, 靠!)

智能引号

矛盾的编码

9. 其他

比较对象是否相同 复制Beautiful Soup对象 解析部分文档 SoupStrainer

10 .常见问题

代码诊断 文档解析错误 版本错误 解析成XML 解析器的错误 杂项错误 如何提高效率

11. Beautiful Soup 3

迁移到BS4 需要的解析器 方法名的变化 生成器 XML 实体 迁移杂项