|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
网站的html代码结构是这样的:
<div id="site-list-content">
<div>
<div>
<a>
<div>
这里面是我需要的标题
</div>
</a>
<div>
这里面是我需要的内容
<span>
杂的我不需要的东西
</span>
</div>
</div>
</div>
</div>
然后现在我写的代码是这样的
sel=scrapy.selector.Selector(response)
sites=sel.xpath('//*[@id="site-list-content"]/div/div')
for site in sites:
title=site.xpath('a/div/text()').extract()
desc=site.xpath('div/text()').extract()
print(title,desc)
这样我会输出
['Data Structures and Algorithms with Object-Oriented Design Patterns in Python '] ['\r\n\t\t\t\r\n The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\r\n ', '\r\n ']
[] []
[] []
['Dive Into Python 3 '] ['\r\n\t\t\t\r\n By Mark Pilgrim, Guide to Python 3 and its differences from Python 2. Each chapter starts with a real code sample and explains it fully. Has a comprehensive appendix of all the syntactic and semantic changes in Python 3\r\n\r\n\r\n ', '\r\n ']
[] []
[] []
这种带有两对空列表的结果,想请问各位大腿要怎么处理这个才能把这些空的去掉,以及怎么才能把第二个里面的span去掉不返回它里面的text
好吧,就这样了,又给你美化了一下
- import requests
- from lxml import etree
- url="http://www.dmoztools.net/Computers/Programming/Languages/Python/Books/"
- headers={"user-agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3947.100 Firefox/73.0 Safari/537.36"}
- response=requests.get(url,headers=headers)
- tree = etree.HTML(response.text)
- data=tree.xpath("//div[@id='site-list-content']/div")
- lst=[]
- for x in range(1,len(data)+1):
- ddict={}
- title=tree.xpath("//section[@class='results sites']/div[1]/div[1]/div[%d]/div[3]/a/div[1]/text()"%x)
- content=tree.xpath("//section[@class='results sites']/div[1]/div[1]/div[%d]/div[3]/div[1]/text()"%x)
- ddict["title"]=title[0]
- ddict["content"]=" ".join(content[0].replace("\r","").replace("\n","").replace("\t","").split())
- lst.append(ddict)
- for x in lst:
- print(x)
-
-
复制代码
|
|