python爬虫进阶BeautifulSoup节点,Python交流,编程语言专区,鱼C论坛

MSK 发表于 2017-7-10 11:18:01

python爬虫进阶BeautifulSoup节点

本帖最后由 MSK 于 2017-7-10 11:23 编辑

节点

这一帖的内容可能有点多，请大家耐心{:10_254:}
推荐阅读：BeautifulSoup对象

-----------------------------------------------------------------------------------------------------------------------------------------------------------
子节点

在html文档中一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点

<title>就是<head>的子节点，反过来<head>就是<title>的父节点
注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

<head><title>标题</title></head>

Beautiful Soup提供了许多操作和遍历子节点的属性.

最简单的方法就是告诉它你想获取的tag的name.如果想获取 <head> 标签,只要用 soup.head :

soup.head
# <head><title>The Dormouse's story</title></head>

获取<title>标签
soup.title
# <title>The Dormouse's story</title>

你也可以多次调用这个方法
soup.head.title
# <title>The Dormouse's story</title>

但是！！！

通过点取属性的方式只能获得当前名字的第一个tag！

通过点取属性的方式只能获得当前名字的第一个tag！

通过点取属性的方式只能获得当前名字的第一个tag！

find_all

如果想要得到所有的标签，比如<a>标签,就需要用到一些方法,比如: find_all()

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
返回的是一个列表

查看子节点

.contents 和 .children

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
#

BeautifulSoup 对象本身一定会包含子节点,也就是说<html>标签也是 BeautifulSoup 对象的子节点:

**** Hidden Message *****

字符串没有 .contents 属性,因为字符串没有子节点:

text = title_tag.contents
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'

**** Hidden Message *****

通过tag的 .children 生成器,可以对tag的子节点进行循环:

for child in title_tag.children:
print(child)
# The Dormouse's story
.descendants

-----------------------------------------------------------------------------------------------------------------------------------------------------------

子孙节点

上！代！码！{:10_297:} ：

head_tag.contents
# [<title>The Dormouse's story</title>]

.contents 和 .children 属性仅包含tag的直接子节点.例如,<head>标签只有一个直接子节点<title>
但是<title>标签也包含一个子节点:字符串 “The Dormouse’s story”,这种情况下字符串 “The Dormouse’s story”也属于<head>标签的子孙节点.

子节点！= 子孙节点
儿子！= 儿子的儿子

.descendants

.descendants 属性可以对所有tag的子孙节点进行递归循环:

for child in head_tag.descendants:
print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

-----------------------------------------------------------------------------------------------------------------------------------------------------------
父节点

.parent

通过 .parent 属性来获取某个元素的父节点.在例子“爱丽丝”的文档中,<head>标签是<title>标签的父节点:

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

文档title的字符串也有父节点:<title>标签

title_tag.string.parent
# <title>The Dormouse's story</title>

文档的顶层节点比如<html>的父节点是：BeautifulSoup 对象:

html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>

当然BeautifulSoup 对象没有父节点，所以它的 .parent 是None:

print(soup.parent)
# None

-----------------------------------------------------------------------------------------------------------------------------------------------------------
兄弟节点

同一个元素的子节点,被称为兄弟节点{:10_279:}

for example：{:10_335:}

sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>")
print(sibling_soup.prettify())
# <html>
#<body>
# <a>
# 
# text1
# 
# <c>
# text2
# </c>
# </a>
#</body>
# </html>

因为标签和<c>标签是同一层，都是<a>的子节点，所以和<c>可以被称为兄弟节点。

.next_sibling 和 .previous_sibling

使用 .next_sibling 和 .previous_sibling 属性来查询兄弟节点:

**** Hidden Message *****

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling

不过。。。。
**** Hidden Message *****

tuxiaoqing 发表于 2017-9-27 12:08:00

感谢楼主分享

大头目 发表于 2018-2-22 22:50:18

学习

谁与争锋 发表于 2018-3-1 23:10:21

学习

775155480 发表于 2018-3-6 14:28:01

学习

大头目 发表于 2018-3-8 20:58:12

len(soup.contents)
# 1
soup.contents.name
# u'html'
为啥我的运行结果和说明不一样啊？{:5_104:}
from bs4 import BeautifulSoup

html_doc = '''
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story

Once upon a time there were three little sisters; and their names were
<aclass="sister" id="link1">Elsie</a>,
<aclass="sister" id="link2">Lacie</a> and
<aclass="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.'''

soup = BeautifulSoup(html_doc,'html.parser')
print(8,len(soup.contents))
print(9,soup.contents.name)

新少丶 发表于 2018-3-11 17:11:30

学习学习

智能的板砖 发表于 2018-8-25 15:52:21

学习一下

塔利班 发表于 2018-9-18 15:47:05

快给我点吃的 发表于 2018-9-20 14:28:52

写的不错很实用

沉迷include 发表于 2018-11-5 22:53:16

求previous_sibling详解

2011gg 发表于 2018-11-10 15:59:13

{:5_90:}

hujh 发表于 2019-11-8 20:19:31

楼主开心

yizhaosheng 发表于 2019-11-9 13:08:03

学习

cyd55199226 发表于 2020-1-6 16:16:08

666

chinamafia 发表于 2020-3-11 15:14:35

学习

kkk恪 发表于 2020-3-13 15:44:51

{:5_110:}学习了，感谢

liugang8332 发表于 2020-3-22 01:05:23

我想看下面的，谢谢楼主

dululu 发表于 2020-4-3 14:59:29

谢谢楼主的讲解

麻麦皮 发表于 2020-4-18 23:57:51

进来学习一下

页: [1] 2

鱼C论坛's Archiver

python爬虫进阶BeautifulSoup节点