鱼C论坛

 找回密码
 立即注册
查看: 2482|回复: 0

[技术交流] 爬虫实战 - 将爬取数据保存为CSV格式的两种方法

[复制链接]
发表于 2018-2-3 21:36:25 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
以豆瓣图书top250为例:
爬取网址:https://book.douban.com/top250
爬取信息:书名、书名链接,评价、评价人数、一句话点评

第一种方法:
  1. with open("F:/book_top250.csv","w") as f:
  2.     f.write("{},{},{},{},{}\n".format(book_name ,rating, rating_num,comment, book_link))
复制代码


第二种方法:
  1. with open("F:/book_top250.csv","w",newline="") as f:   ##如果不添加newline="",爬取信息会隔行显示
  2.     w = csv.writer(f)
  3.     w.writerow([book_name ,rating, rating_num,comment, book_link])
复制代码


方法一的代码:
  1. import requests
  2. from lxml import etree
  3. import time
  4.    
  5. urls = ['https://book.douban.com/top250?start={}'.format(i * 25) for i in range(10)]
  6. with open("F:/book_top250.csv","w") as f:
  7.     for url in urls:
  8.         r = requests.get(url)
  9.         selector = etree.HTML(r.text)
  10.         
  11.         books = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]')
  12.         for book in books:
  13.             book_name = book.xpath('./div[1]/a/@title')[0]   
  14.             rating = book.xpath('./div[2]/span[2]/text()')[0]
  15.             rating_num = book.xpath('./div[2]/span[3]/text()')[0].strip('()\n ')  #去除包含"(",")","\n"," "的首尾字符
  16.             try:
  17.                 comment = book.xpath('./p[2]/span/text()')[0]
  18.             except:
  19.                 comment = ""
  20.             book_link = book.xpath('./div[1]/a/@href')[0]
  21.             f.write("{},{},{},{},{}\n".format(book_name ,rating, rating_num,comment, book_link))

  22.         time.sleep(1)
复制代码


方法二的代码:
  1. import requests
  2. from lxml import etree
  3. import time
  4. import csv
  5.    
  6. urls = ['https://book.douban.com/top250?start={}'.format(i * 25) for i in range(10)]
  7. with open("F:/book_top250.csv","w",newline='') as f:
  8.     for url in urls:
  9.         r = requests.get(url)
  10.         selector = etree.HTML(r.text)
  11.         
  12.         books = selector.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]')
  13.         for book in books:
  14.             book_name = book.xpath('./div[1]/a/@title')[0]   
  15.             rating = book.xpath('./div[2]/span[2]/text()')[0]
  16.             rating_num = book.xpath('./div[2]/span[3]/text()')[0].strip('()\n ')  #去除包含"(",")","\n"," "的首尾字符
  17.             try:
  18.                 comment = book.xpath('./p[2]/span/text()')[0]
  19.             except:
  20.                 comment = ""
  21.             book_link = book.xpath('./div[1]/a/@href')[0]

  22.             w = csv.writer(f)
  23.             w.writerow([book_name ,rating, rating_num,comment, book_link])
  24.         time.sleep(1)
复制代码



小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-9-23 02:33

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表