|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
以豆瓣图书top250为例:
爬取网址:https://book.douban.com/top250
爬取信息:书名、书名链接,评价、评价人数、一句话点评
第一种方法:
- with open("F:/book_top250.csv","w") as f:
- f.write("{},{},{},{},{}\n".format(book_name ,rating, rating_num,comment, book_link))
复制代码
第二种方法:
- with open("F:/book_top250.csv","w",newline="") as f: ##如果不添加newline="",爬取信息会隔行显示
- w = csv.writer(f)
- w.writerow([book_name ,rating, rating_num,comment, book_link])
复制代码
方法一的代码:
- import requests
- from lxml import etree
- import time
-
- urls = ['https://book.douban.com/top250?start={}'.format(i * 25) for i in range(10)]
- with open("F:/book_top250.csv","w") as f:
- for url in urls:
- r = requests.get(url)
- selector = etree.HTML(r.text)
-
- books = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]')
- for book in books:
- book_name = book.xpath('./div[1]/a/@title')[0]
- rating = book.xpath('./div[2]/span[2]/text()')[0]
- rating_num = book.xpath('./div[2]/span[3]/text()')[0].strip('()\n ') #去除包含"(",")","\n"," "的首尾字符
- try:
- comment = book.xpath('./p[2]/span/text()')[0]
- except:
- comment = ""
- book_link = book.xpath('./div[1]/a/@href')[0]
- f.write("{},{},{},{},{}\n".format(book_name ,rating, rating_num,comment, book_link))
- time.sleep(1)
复制代码
方法二的代码:
- import requests
- from lxml import etree
- import time
- import csv
-
- urls = ['https://book.douban.com/top250?start={}'.format(i * 25) for i in range(10)]
- with open("F:/book_top250.csv","w",newline='') as f:
- for url in urls:
- r = requests.get(url)
- selector = etree.HTML(r.text)
-
- books = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]')
- for book in books:
- book_name = book.xpath('./div[1]/a/@title')[0]
- rating = book.xpath('./div[2]/span[2]/text()')[0]
- rating_num = book.xpath('./div[2]/span[3]/text()')[0].strip('()\n ') #去除包含"(",")","\n"," "的首尾字符
- try:
- comment = book.xpath('./p[2]/span/text()')[0]
- except:
- comment = ""
- book_link = book.xpath('./div[1]/a/@href')[0]
- w = csv.writer(f)
- w.writerow([book_name ,rating, rating_num,comment, book_link])
- time.sleep(1)
复制代码
|
|