鱼C论坛

 找回密码
 立即注册
查看: 1407|回复: 3

爬虫如何避免重复?

[复制链接]
发表于 2018-4-15 14:32:18 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
>>> from bs4 import BeautifulSoup as BS
>>> html_doc = """<html><head></head>
<body>
<p class="title"><b>An Interesting Story</b></p>
<p class="story">Long long age, there lives an interesting man. He lives with
<a  class="father" id="link1">father</a>
<a  class="mother> id="link2">mother</a>
<a  class="brother" id="link3">brother</a>and
<a  class="sister" id="link4">sister</a>.
They live in harmony...</p>
<p class="story">...</p>
</body>
"""
>>> soup = BS(html_doc, 'lxml')


>>> tag = soup.p.b

>>> for parent in tag.parents:
        print(parent)

       
<p class="title"><b>An Interesting Story</b></p>
<body>
<p class="title"><b>An Interesting Story</b></p>
<p class="story">Long long age, there lives an interesting man. He lives with
<a class="father"  id="link1">father</a>
<a class="mother&gt; id="  link2="">mother</a>
<a class="brother"  id="link3">brother</a>and
<a class="sister"  id="link4">sister</a>.
They live in harmony...</p>
<p class="story">...</p>
</body>
<html><head></head>
<body>
<p class="title"><b>An Interesting Story</b></p>
<p class="story">Long long age, there lives an interesting man. He lives with
<a class="father"  id="link1">father</a>
<a class="mother&gt; id="  link2="">mother</a>
<a class="brother"  id="link3">brother</a>and
<a class="sister"  id="link4">sister</a>.
They live in harmony...</p>
<p class="story">...</p>
</body>
</html>
<html><head></head>
<body>
<p class="title"><b>An Interesting Story</b></p>
<p class="story">Long long age, there lives an interesting man. He lives with
<a class="father"  id="link1">father</a>
<a class="mother&gt; id="  link2="">mother</a>
<a class="brother"  id="link3">brother</a>and
<a class="sister"  id="link4">sister</a>.
They live in harmony...</p>
<p class="story">...</p>
</body>
</html>
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2018-4-15 14:34:34 | 显示全部楼层
貌似重复了。怎么解决?
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-4-15 15:00:06 | 显示全部楼层
tag.parents:获得当前匹配元素集合中每个元素的祖先元素
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2018-4-15 18:14:59 | 显示全部楼层
数据用set方法去重。
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-12-29 10:48

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表