|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
def spider():
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"}
for x in range(1,2):
url = f"http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{x}"
response = requests.get(url,headers=headers)
selector = etree.HTML(response.text)
book_list=selector.xpath("//*[@class='bang_list clearfix bang_list_mode']/li") #提取出每一本书的整体信息
for book in book_list:
book_name = book.xpath("//li/div[@class='name']/a/@title") #提取出每一本书的书名
print(book_name)
spider()
我认为输出的代码应该是这样的:
'蛤蟆先生去看心理医生(畅销100万册!英国经典心理咨询入门书,知名心理学家李松蔚强烈推荐)'
'文城(余华新书,时隔8年重磅归来,《活着》之后又一精彩力作)'
'少年读史记(套装全5册)'
......................等等
本帖最后由 suchocolate 于 2021-6-8 22:51 编辑
- import requests
- from lxml import etree
- def spider():
- headers = {
- "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36"}
- for x in range(1, 2):
- url = f"http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-{x}"
- r = requests.get(url, headers=headers)
- html = etree.HTML(r.text)
- result = html.xpath('//li/div[@class="name"]/a/@title') # 直接取title即可,我看页面li干扰元素太多,先取li再取title反而慢。
- for book in result:
- print(book)
- if __name__ == "__main__":
- spider()
复制代码
|
|