|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 payton24 于 2018-2-3 18:01 编辑
豆瓣的资料还是挺多的,继续爬。上一篇用copy xpath的路径比较靠谱,这次继续用。
这次爬豆瓣图书Top250。
爬取网址:https://book.douban.com/top250
爬取信息:书名、书名链接,评价、评价人数、一句话点评
先爬书名,一下子掉进了坑,原来是chrome、Firefox等浏览器为了对html文本进行规范化,在table标签下添加了tbody。
使用【查看网页源代码】和【审查】来进行比较,就能清楚看到上述结果。
于是,果断去掉table后面的/tbody。
- 原xpath路径: //*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a
- 替换xpath路径://*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a
复制代码
①第一本图书信息的代码为:
- import requests
- from lxml import etree
- url = 'https://book.douban.com/top250'
- r = requests.get(url)
- #print(r.status_code)
- selector = etree.HTML(r.text)
- # book_name = selector.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tbody/tr/td[2]/div[1]/a/text()')
- book_name = selector.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a/@title')
- rating = selector.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[2]/text()')
- rating_num = selector.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[2]/span[3]/text()')
- comment = selector.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/p[2]/span/text()')
- book_link = selector.xpath('//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a/@href')
- print(book_name , rating, rating_num, comment, book_link)
复制代码
要爬取多本图书的信息,只需列出多本图书的xpath信息进行观察,结果发现table序号不断增加,以书名为例:
//*[@id="content"]/div/div[1]/div/table[1]/tr/td[2]/div[1]/a
//*[@id="content"]/div/div[1]/div/table[2]/tr/td[2]/div[1]/a
//*[@id="content"]/div/div[1]/div/table[3]/tr/td[2]/div[1]/a
//*[@id="content"]/div/div[1]/div/table[4]/tr/td[2]/div[1]/a
②所以获取多本图书书名的xpath路径为:
//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a
整理后的代码为:
- import requests
- from lxml import etree
- url = 'https://book.douban.com/top250'
- r = requests.get(url)
- #print(r.status_code)
- selector = etree.HTML(r.text)
- book_names = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a/@title')
- ratings = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[2]/span[2]/text()')
- rating_nums = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[2]/span[3]/text()')
- comments = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/p[2]/span/text()')
- book_links = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a/@href')
- print(book_names , ratings, rating_nums, comments, book_links)
复制代码
③另外,上述获取书名、链接的xpath路径有重叠,所以实际操作往往先提取前面重复部分,并将列表中的内容提取出来处理。
- books = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]')
- for book in books:
- book_name = book.xpath('./div[1]/a/@title')[0]
- rating = book.xpath('./div[2]/span[2]/text()')[0]
- rating_num = book.xpath('./div[2]/span[3]/text()')[0].strip('()\n ') #去除包含"(",")","\n"," "的首尾字符
- comment = book.xpath('./p[2]/span/text()')[0]
- book_link = book.xpath('./div[1]/a/@href')[0]
- print(book_name ,rating, rating_num,comment, book_link)
复制代码
④多页面信息爬取,观察页面规律:
- https://book.douban.com/top250?start=0
- https://book.douban.com/top250?start=25
- https://book.douban.com/top250?start=50
- https://book.douban.com/top250?start=75
复制代码
很明显,每页最后的数字从0开始,依次递增25,所以可构造一个url列表:
- urls = ['https://book.douban.com/top250?start={}'.format(i * 25) for i in range(10)]
复制代码
⑤附上全部代码,由于某些书不存在评论,故而使用了try语句:
- import requests
- from lxml import etree
- import time
- urls = ['https://book.douban.com/top250?start={}'.format(i * 25) for i in range(10)]
- for url in urls:
- r = requests.get(url)
- selector = etree.HTML(r.text)
-
- books = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]')
- for book in books:
- book_name = book.xpath('./div[1]/a/@title')[0]
- rating = book.xpath('./div[2]/span[2]/text()')[0]
- rating_num = book.xpath('./div[2]/span[3]/text()')[0].strip('()\n ') #去除包含"(",")","\n"," "的首尾字符
- try:
- comment = book.xpath('./p[2]/span/text()')[0]
- except:
- comment = ""
- book_link = book.xpath('./div[1]/a/@href')[0]
- print(book_name ,rating, rating_num,comment, book_link)
-
- time.sleep(3)
复制代码
|
|