[已解决]我在爬那个www.dmoztools.net的时候会产生很多空列表，求各位大腿帮帮我实在想不出来

zzw10 · 发表于 2020-3-9 03:32:46

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

网站的html代码结构是这样的:
<div id="site-list-content">
<div>
<div>
<a>
<div>
这里面是我需要的标题
</div>
</a>
<div>
这里面是我需要的内容
<span>
杂的我不需要的东西
</span>
</div>
</div>
</div>
</div>

然后现在我写的代码是这样的
      sel=scrapy.selector.Selector(response)
      sites=sel.xpath('//*[@id="site-list-content"]/div/div')
      for site in sites:
         title=site.xpath('a/div/text()').extract()
         desc=site.xpath('div/text()').extract()
         print(title,desc)

这样我会输出
['Data Structures and Algorithms with Object-Oriented Design Patterns in Python '] ['\r\n\t\t\t\r\n                                  The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n                                  ', '\r\n                               ']
[] []
[] []
['Dive Into Python 3 '] ['\r\n\t\t\t\r\n                                  By Mark Pilgrim, Guide to Python 3  and its differences from Python 2. Each chapter starts with a real code sample and explains it fully. Has a comprehensive appendix of all the syntactic and semantic changes in Python 3\r\n\r\n\r\n                                  ', '\r\n                               ']
[] []
[] []

这种带有两对空列表的结果，想请问各位大腿要怎么处理这个才能把这些空的去掉，以及怎么才能把第二个里面的span去掉不返回它里面的text

最佳答案

月排行榜 / 总排行榜

wp231957

2020-3-9 15:29:43

zzw10 发表于 2020-3-9 14:33
这是网址
http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/
要取的是sites 18 ...

好吧，就这样了，又给你美化了一下

import requests
from lxml import etree
url="http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/"
headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Firefox/73.0 Safari/537.36"}
response=requests.get(url,headers=headers)
tree = etree.HTML(response.text)
data=tree.xpath("//div[@id='site-list-content']/div")
lst=[]
for x in range(1,len(data)+1):
ddict={}
title=tree.xpath("//section[@class='results sites']/div[1]/div[1]/div[%d]/div[3]/a/div[1]/text()"%x)
content=tree.xpath("//section[@class='results sites']/div[1]/div[1]/div[%d]/div[3]/div[1]/text()"%x)
ddict["title"]=title[0]
ddict["content"]=" ".join(content[0].replace("\r","").replace("\n","").replace("\t","").split())
lst.append(ddict)
for x in lst:
print(x)

复制代码

跳转到最佳答案楼层

wp231957 · 发表于 2020-3-9 08:32:00

我不知道你要取得数据在哪个子版块，所以就随便选了一个
不过确认不会出现你说的情况
因为那个SCRAPY我不会用，我用requests+lxml简单写了下

import requests
from lxml import etree
url="http://www.dmoztools.net/Business/Accounting/Business-to-Business/"
headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Firefox/73.0 Safari/537.36"}
response=requests.get(url,headers=headers)
tree = etree.HTML(response.text)
data=tree.xpath("//div[@id='site-list-content']/div[1]/div[1]/i/@class")
print(data)
data=tree.xpath("//div[@id='site-list-content']/div[1]/div[3]/div[1]/text()")
print(data)

复制代码

zzw10 · 发表于 2020-3-9 11:01:54

wp231957 发表于 2020-3-9 08:32
我不知道你要取得数据在哪个子版块，所以就随便选了一个
不过确认不会出现你说的情况
因为那个SCRAPY我不 ...

这个是取一条数据吧，我看div后面有index下标，取一条的标题+内容我是会的，我加下标之后显示一条标题+内容不会有[]，但是我一找所有的标题+内容就会出现空的[]

wp231957 · 发表于 2020-3-9 11:03:28

zzw10 发表于 2020-3-9 11:01
这个是取一条数据吧，我看div后面有index下标，取一条的标题+内容我是会的，我加下标之后显示一条标题+ ...

你把要取的元素用红色标一下，同时网址发出来

zzw10 · 发表于 2020-3-9 14:33:02

wp231957 发表于 2020-3-9 11:03
你把要取的元素用红色标一下，同时网址发出来

这是网址
http://www.dmoztools.net/Compute ... uages/Python/Books/
要取的是sites 18里的标题和内容，比如第一个就是
Data Structures and Algorithms with Object-Oriented Design Patterns in Python
和
The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.
A secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.
然后想取这个网址所有的标题+内容，但是不知道按照我写的xpath那样取的问题出在哪，麻烦您了！

wp231957 · 发表于 2020-3-9 15:29:43

这个最佳答案由 wp231957 给出，感谢 wp231957 的回答。

单击隐藏图章

zzw10 发表于 2020-3-9 14:33
这是网址
http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/
要取的是sites 18 ...

好吧，就这样了，又给你美化了一下

import requests
from lxml import etree
url="http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/"
headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Firefox/73.0 Safari/537.36"}
response=requests.get(url,headers=headers)
tree = etree.HTML(response.text)
data=tree.xpath("//div[@id='site-list-content']/div")
lst=[]
for x in range(1,len(data)+1):
ddict={}
title=tree.xpath("//section[@class='results sites']/div[1]/div[1]/div[%d]/div[3]/a/div[1]/text()"%x)
content=tree.xpath("//section[@class='results sites']/div[1]/div[1]/div[%d]/div[3]/div[1]/text()"%x)
ddict["title"]=title[0]
ddict["content"]=" ".join(content[0].replace("\r","").replace("\n","").replace("\t","").split())
lst.append(ddict)
for x in lst:
print(x)

复制代码

zzw10 · 发表于 2020-3-9 15:45:02

wp231957 发表于 2020-3-9 15:29
好吧，就这样了，又给你美化了一下

很感谢你！

账号		自动登录	找回密码
密码			立即注册

[已解决]我在爬那个www.dmoztools.net的时候会产生很多空列表，求各位大腿帮帮我实在想不出来

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块