Haste makes waste

Python网络爬虫-02-提取-Beautiful Soup库(补充-莫烦Python)

Posted on By lijun

  1. 参考网络课程:Python网络爬虫与信息提取
  2. youtube-Python 6小时网络爬虫入门课程完整版(2020年)
  3. 关于beautiful soup库,更详细说明资料参考Beautiful Soup 4.4.0 文档
  4. 如何在uipath中使用python
  5. 井上さん:RPAにおけるインテグレーションのためのライブラリ開発

Beautiful soup库与Requests库的功能分担如下图:

image

下一步的工作中,需要大量解析网页中的内容,这个库用的比较多,要掌握。

1. Beautiful Soup库入门

1.1 安装:

(base) C:\Users\jun.li>pip install beautifulsoup4

下面打印出一个示例页面的html源代码:

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
kv = {'user-agent':'Mozilla/5.0'}
try:
    r = requests.get(url,headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    demo = r.text
    
    # Beautiful Soup 利用
    soup = BeautifulSoup(demo,"html.parser")
    print(soup.prettify())
except:
    print("NG")

打印出来结果如下:

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

1.2 Beautiful Soup库的基本元素

基本元素 说明
Tag 标签,最基本的信息组织单元,分别用<>和</>标明开头和结尾
Name 标签的名字,<p>..</p>的名字是p,格式<tag>.name
Attributes 标签的属性,字典形式组织,格式<tag>.attrs
NavigableString 标签内非属性字符串,<p>...</p>中的省略号,格式<tag>.string
Comment 标签内字符串的注释部分,一种特殊的Comment类型

示例:

<p class="title">...</p>
  1. 整个表示标签
  2. p是标签的名称name
  3. class与title是属性attrs,因为会有多个,所以是字典类型
  4. …表示非属性字符串,注释

示例:

  • 将html内容做成soup,并从中提取p标签,赋值给tag
import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
kv = {'user-agent':'Mozilla/5.0'}
try:
    r = requests.get(url,headers = kv)
    r.raise_for_status()
    r.encoding = r.apparent_encoding
    demo = r.text
    
    # Beautiful Soup 利用
    soup = BeautifulSoup(demo,"html.parser")
    tag = soup.p  #返回p标签
except:
    print("NG")
  • 从p的tag中获取各种信息:
print("1:",tag)
print("2:",tag.name)
print("3:",tag.attrs)
print("4:",tag.string)
print("5:",tag.parent.name)

打印结果如下:

1: <p class="title"><b>The demo python introduces several python courses.</b></p>
2: p
3: {'class': ['title']}
4: The demo python introduces several python courses.
5: body
  • string与Comment
newsoup = BeautifulSoup("<b><!--This is a b tag comment --></b><p>This is not a comment</p>","html.parser")

print("1:",newsoup.b.string)
print("2:",type(newsoup.b.string))

print("3:",newsoup.p.string)
print("4:",type(newsoup.p.string))

输出结果如下:

1: This is a b tag comment 
2: <class 'bs4.element.Comment'>
3: This is not a comment
4: <class 'bs4.element.NavigableString'>

1.3 基于bs4库的HTML内容遍历方法

回顾下demo.html的格式:

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

基于标签树的上行遍历,下行遍历,平行遍历:

image

image

示例如下,先煲汤:

import requests
from bs4 import BeautifulSoup
url = "https://python123.io/ws/demo.html"
kv = {'user-agent':'Mozilla/5.0'}

r = requests.get(url,headers = kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
demo = r.text
    
# Beautiful Soup 利用
soup = BeautifulSoup(demo,"html.parser")

1.4 下行遍历

print("1:",soup.head)
print("2:",soup.head.contents)
print("3:",soup.body.contents)
print("4:",soup.body.contents[1])

输出如下:

1: <head><title>This is a python demo page</title></head>
2: [<title>This is a python demo page</title>]
3: ['\n', <p class="title"><b>The demo python introduces several python courses.</b></p>, '\n', <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>, '\n']
4: <p class="title"><b>The demo python introduces several python courses.</b></p>
  • 遍历儿子节点
for child in soup.body.children:
    print(child)
    print("---")

输出如下:

<p class="title"><b>The demo python introduces several python courses.</b></p>


<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>

  • 遍历子孙节点

迭代输出,比如第一个<p><b>×××</b></p>,1.整体输出一次,2.b tag输出一次,3.×××输出一次。

for child in soup.body.descendants:
    print(child)
    print("---")

输出如下:

---
<p class="title"><b>The demo python introduces several python courses.</b></p>
---
<b>The demo python introduces several python courses.</b>
---
The demo python introduces several python courses.
---


---
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
---
Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

---
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
---
Basic Python
---
 and 
---
<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
---
Advanced Python
---
.
---


---

1.5 上行遍历

for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

输出如下:

p
body
html
[document]

1.6 平行遍历

print("1:",soup.a.next_sibling)
print("2:",soup.a.next_sibling.next_sibling)
print("3:",soup.a.previous_sibling)
print("4:",soup.a.previous_sibling.previous_sibling)
print("5:",soup.a.parent)

输出如下:

  1. and,是下一个的平行节点
  2. Python is a wonder…,是其上一个的平行节点
1:  and 
2: <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>
3: Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:

4: None
5: <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
  • 遍历后续节点和遍历前序节点
for sibling in soup.a.next_sibling:
    print(sibling)
for sibling in soup.a.previous_sibling:
    print(sibling)

1.7 bs4的格式化输出

print("1:",soup.prettify())
print("2:",soup.a.prettify())

2. 信息组织与提取方法

2.1 信息标记的三种方式

XML,JSON,YAML,各有优缺点。

2.2 [重要]基于bs4库的HTML内容查找方法

<>.find_all(name,attrs,recursive,string,**kwargs)
  1. 返回一个列表类型,存储查找的结果。
  2. name:对标签名称的检索字符串。
  3. attrs:对标签属性值得检索字符串,可标注属性检索
  4. recursive:是否对子孙全部检索,默认True
  5. string:<>…</>中字符串区域的检索字符串
print("1:",soup.find_all('a'))
print("2:",soup.find_all(['a','b']))

结果:

1: [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
2: [<b>The demo python introduces several python courses.</b>, <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]

可以通过soup.find_all(True)检索所有标签。

  • 使用正则表达式:
import re

for tag in soup.find_all(re.compile("b")):
    print(tag.name)

检索所有以b开头的tag 名:

body
b
  • P标签中,属性中带有course字符的标签
soup.find_all("p","course")
[<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
 <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
  • 使用id进行查找:
import re
print("1:",soup.find_all(id='link1'))
print("2:",soup.find_all(id='link2'))
print("\n")
print("3:",soup.find_all(id=re.compile("link")))

输出如下:

1: [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
2: [<a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]


3: [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
  • string中带有特定字符串
print("1:",soup.find_all(string = re.compile("python")))

输出:

1: ['This is a python demo page', 'The demo python introduces several python courses.']

2.3 扩展方法

注意有两种简写方法:

  1. <tag>(..),等价于<tag>.find_all(..)
  2. soup(..),等价于soup.find_all(..)

image

3. 示例,中国大学排名爬取

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print("NG")
    
    return ""


def fillUnivList(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find("tbody").children:
        if isinstance(tr,bs4.element.Tag):
            tds = tr("td")
            ulist.append([tds[0].string,tds[1].string,tds[2].string])
    pass


def printUnivList(ulist,num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校","总分"))
    for i in range(num):
        u = ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))
    
    print("Suc" + str(num))

def main():
    uinfo = []
    url = "http://www.zuihaodaxue.com/zuihaodaxuepaiming2019.html"
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,20) # 20 univs
main()

输出:

    排名    	  学校  	    总分    
    1     	 清华大学 	    北京    
    2     	 北京大学 	    北京    
    3     	 浙江大学 	    浙江    
    4     	上海交通大学	    上海    
    5     	 复旦大学 	    上海    
    6     	中国科学技术大学	    安徽    
    7     	华中科技大学	    湖北    
    7     	 南京大学 	    江苏    
    9     	 中山大学 	    广东    
    10    	哈尔滨工业大学	   黑龙江    
    11    	北京航空航天大学	    北京    
    12    	 武汉大学 	    湖北    
    13    	 同济大学 	    上海    
    14    	西安交通大学	    陕西    
    15    	 四川大学 	    四川    
    16    	北京理工大学	    北京    
    17    	 东南大学 	    江苏    
    18    	 南开大学 	    天津    
    19    	 天津大学 	    天津    
    20    	华南理工大学	    广东    
Suc20

4. 从指定网站抽取QA

import requests
from bs4 import BeautifulSoup
import bs4
import re

def getHTMLText(url):
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        print("NG")
    
    return ""

def fillQAList(ulist,html):
    soup = BeautifulSoup(html,"html.parser")
    
    print("-----------------------------------------質問-------------------------------------------")
    print("\n\n")
    
    for question in soup.find_all(class_="m-hdgLv3__hdg"):
        if(question.text.startswith("問")):
            print(question.text)
    
    print("\n\n")
    print("------------------------------------------質問と回答--------------------------------------")
    
    for question in soup.find_all(class_="m-grid__col1"):
        if(question.text.find("ページの先頭へ戻る")):
            for answer in question.text.split("ページの先頭へ戻る")[:-1]:
                print(answer.rstrip('\n'))
        pass

def main():
    uinfo = []
    url1 = {"name" : "よくあるお問い合わせをまとめました(FAQ)(2月21日版)","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/newpage_00017.html"}
    url2 = {"name" : "一般の方向けQ&A(4月8日版)","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/dengue_fever_qa_00001.html"}
    url3 = {"name" : "医療機関・検査機関向けQ&A(4月7日版)","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/dengue_fever_qa_00004.html"}
    url4 = {"name" : "企業(労務)の方向けQ&A(4月6日版)","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/dengue_fever_qa_00007.html"}
    url5 = {"name" : "労働者の方向けQ&A(3月25日版)","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/dengue_fever_qa_00018.html"}
    url6 = {"name" : "関連業種の方向けQ&A(4月2日版)","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/covid19_qa_kanrenkigyou.html"}
    url7 = {"name" : "水際対策の抜本的強化に関するQ&A(4月2日版)","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/covid19_qa_kanrenkigyou_00001.html"}
    url8 = {"name" : "学校再開に関するQ&A(子供たち、保護者、一般の方へ)","url" : "https://www.mext.go.jp/a_menu/coronavirus/mext_00003.html"}
    url9 = {"name" : "布マスクの全戸配布に関するQ&A","url" : "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/cloth_mask_qa_.html"}

    urllist = [url1,url2,url3,url4,url5,url6,url7,url8,url9]
    
    for url in urllist:
        print("\n\n\n")
        print(url["name"])
        html = getHTMLText(url["url"])
        fillQAList(uinfo,html)
    
main()

5. 补充-爬虫与网页结构

参考 莫烦-python

非常简单的介绍,跳过!

6. 补充-BeautifulSoup解析网页

爬网页的流程:

  1. 选着要爬的网址 (url)
  2. 使用 python 登录上这个网址 (urlopen等)
  3. 读取网页信息 (read() 出来)
  4. **将读取的信息放入 BeautifulSoup **
  5. **使用 BeautifulSoup 选取 tag 信息等 (代替正则表达式) **

示例如下:

import requests
from bs4 import BeautifulSoup

# 1. 选着要爬的网址 (url)
url = "https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/cloth_mask_qa_.html"

# 2. 使用 python 登录上这个网址 (urlopen等)
# 3. 读取网页信息 (read() 出来)
r = requests.get(url)
r.encoding = r.apparent_encoding

# 4. **将读取的信息放入 BeautifulSoup **
soup = BeautifulSoup(r.text,"html.parser")

# 5. **使用 BeautifulSoup 选取 tag 信息等 (代替正则表达式) **
print("\n3:",soup.find_all('a'))
print("\n4:",soup.find(id="link3")['href'])

6.1 基础

import requests
from bs4 import BeautifulSoup
url = "https://morvanzhou.github.io/static/scraping/basic-structure.html"
r = requests.get(url)
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,"html.parser")

all_href = soup.find_all('a')
print("1,",all_href)

all_href = [l['href'] for l in all_href]
print("2,",all_href)

输出如下:

1, [<a href="https://morvanzhou.github.io/">莫烦Python</a>, <a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a>]
2, ['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']

可以看出,上面1输出的是a tag的整理内容,下面用l['href']输出指定的href。

6.2 CSS

使用CSS中的class进行内容提取。先看看这个示例网页的结构,非常简单的一个网页

<html lang="cn"><head>
	<meta charset="UTF-8">
	<title>爬虫练习 列表 class | 莫烦 Python</title>
	<style>
	.jan {
		background-color: yellow;
	}
	.feb {
		font-size: 25px;
	}
	.month {
		color: red;
	}
	</style>
</head>

<body>

<h1>列表 爬虫练习</h1>

<p>这是一个在 <a href="https://morvanzhou.github.io/">莫烦 Python</a><a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a>
	里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<ul>
	<li class="month">一月</li>
	<ul class="jan">
		<li>一月一号</li>
		<li>一月二号</li>
		<li>一月三号</li>
	</ul>
	<li class="feb month">二月</li>
	<li class="month">三月</li>
	<li class="month">四月</li>
	<li class="month">五月</li>
</ul>

</body></html>

示例:

import requests
from bs4 import BeautifulSoup
url = "https://morvanzhou.github.io/static/scraping/list.html"
r = requests.get(url)
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,"html.parser")

month = soup.find_all('li',{'class':'month'})

for m in month:
    print("1,",m)
    print('----')
    print('2,',m.string)
    
print('-----------------')

jan = soup.find('ul',{'class':'jan'})

d_jan = jan.find_all('li')

for d in d_jan:
    print("3,",d.string)

输出:

1, <li class="month">一月</li>
----
2, 一月
1, <li class="feb month">二月</li>
----
2, 二月
1, <li class="month">三月</li>
----
2, 三月
1, <li class="month">四月</li>
----
2, 四月
1, <li class="month">五月</li>
----
2, 五月
-----------------
3, 一月一号
3, 一月二号
3, 一月三号

6.3 正则表达式

测试网页html,比较简单,在body中有个table,其中包含了一些图片链接:

<html lang="cn"><head>
	<meta charset="UTF-8">
	<title>爬虫练习 表格 table | 莫烦 Python</title>

	<style>
	img {
		width: 250px;
	}
	table{
		width:50%;
	}
	td{
		margin:10px;
		padding:15px;
	}
	</style>
</head>
<body>

<h1>表格 爬虫练习</h1>

<p>这是一个在 <a href="https://morvanzhou.github.io/">莫烦 Python</a><a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a>
	里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<br>
<table id="course-list">
	<tbody><tr>
		<th>
			分类
		</th><th>
			名字
		</th><th>
			时长
		</th><th>
			预览
		</th>
	</tr>

	<tr id="course1" class="ml">
		<td>
			机器学习
		</td><td>
			<a href="https://morvanzhou.github.io/tutorials/machine-learning/tensorflow/">
				Tensorflow 神经网络</a>
		</td><td>
			2:00
		</td><td>
			<img src="https://morvanzhou.github.io/static/img/course_cover/tf.jpg">
		</td>
	</tr>

	<tr id="course2" class="ml">
		<td>
			机器学习
		</td><td>
			<a href="https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/">
				强化学习</a>
		</td><td>
			5:00
		</td><td>
			<img src="https://morvanzhou.github.io/static/img/course_cover/rl.jpg">
		</td>
	</tr>

	<tr id="course3" class="data">
		<td>
			数据处理
		</td><td>
			<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">
				爬虫</a>
		</td><td>
			3:00
		</td><td>
			<img src="https://morvanzhou.github.io/static/img/course_cover/scraping.jpg">
		</td>
	</tr>

</tbody></table>


</body></html>
import requests
import re
from bs4 import BeautifulSoup
url = "https://morvanzhou.github.io/static/scraping/table.html"
r = requests.get(url)
r.encoding = r.apparent_encoding
soup = BeautifulSoup(r.text,"html.parser")

img_links = soup.find_all('img',{'src':re.compile('.*?\.jpg')})

for link in img_links:
    print("1,",link['src'])

这样输出为:

1, https://morvanzhou.github.io/static/img/course_cover/tf.jpg
1, https://morvanzhou.github.io/static/img/course_cover/rl.jpg
1, https://morvanzhou.github.io/static/img/course_cover/scraping.jpg

6.4 练习:爬百度百科

暂时跳过,这里没有具体解析网页构造,参考意义不大。

7. 补充-更多请求/下载方式

7.1 多功能的Requests

7.2 下载文件

7.3 下载美图

8. 补充-加速你的爬虫

8.1 加速爬虫:多进程分布式

8.2 异步加载Asynico

9. 补充-高级爬虫

9.1 让Selenium控制浏览器

9.2 高效无忧的Scrapy爬虫库