python爬虫进阶beautifulsoup【0】

MSK · 发表于 2017-7-9 13:16:43

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 MSK 于 2017-7-9 23:01 编辑

推荐阅读：BeautifulSoup对象

写爬虫觉得正则表达式太难？不妨试试BeautifulSoup！！！

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库，
首先你需要一些 html 基础

1.安装

pip install beautifulsoup4

ps:我们使用的版本是BeautifulSoup4

我就喜欢用pip

2.导入

from bs4 import BeautifulSoup

3.beautifulsoup 初窥

先给出一段html代码，以后会经常用到

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
The Dormouse's story
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.'''

复制代码

导入模块、生成BeautifulSoup对象

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')

复制代码

获取<title>标签

soup.title
#<title>The Dormouse's story</title>

复制代码

获取<title>标签的name

soup.title.name
#title

复制代码

看，比正则表达式简单吧

获取<title>标签的文本

soup.title.string
#"The Dormouse's story"
soup.title.text
#"The Dormouse's story"

复制代码

获取标签的class

soup.p['class']
#['title']

复制代码

找出所有<a>标签

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

复制代码

找出id为link3的标签

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

复制代码

这篇帖子先让大家对BeautifulSoup有个了解

未完待续！！！

账号		自动登录	找回密码
密码			立即注册

[技术交流] python爬虫进阶beautifulsoup【0】

马上注册，结交更多好友，享用更多功能^_^

评分

本帖被以下淘专辑推荐:

浏览过的版块