鱼C论坛

 找回密码
 立即注册
查看: 2629|回复: 5

[已解决]大佬求助—python爬虫爬取评论,但有些网址出了问题

[复制链接]
发表于 2021-7-7 09:23:23 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
本帖最后由 3236654291 于 2021-7-7 09:42 编辑

有些网页爬取不到页数
比如:
https://fishc.com.cn/thread-107659-1-1.html
为什么

  1. import requests
  2. import bs4
  3. import re


  4. def get_num(url): #获取页数
  5.     headers = {}
  6.     headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'

  7.     res = requests.get(url,headers = headers)
  8.     soup = bs4.BeautifulSoup(res.text,"html.parser")

  9.     number  = []
  10.     target = soup.find_all("div",class_="pg")
  11.     for each in target:
  12.         number.append(each.label.span.text)
  13.     nr = re.search(r"\d+",str(number))
  14.     print(number)
  15.     if str(type(nr)) == "<class 'NoneType'>":
  16.         return "1"
  17.     else:
  18.         return str(nr.group())



  19. def main(url,c,running = True): #爬取评论
  20.     print(url)
  21.     headers = {}
  22.     headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'

  23.     res = requests.get(url,headers = headers)
  24.     soup = bs4.BeautifulSoup(res.text,"html.parser")

  25.     targets = soup.find_all("td",class_="t_f")

  26.     for i in targets:
  27.         if running:
  28.             running = False
  29.         else:
  30.             c.append(i.text)
  31.     return c
  32.    
  33. def keep(file_name,what): #保存文件
  34.     content = ''.join('%s'%id for id in what[0])
  35.     with open(file_name,"w",encoding='utf-8') as f:
  36.         f.write(str(content))
  37.    

  38. def decide(url): #判断网址
  39.     hi = re.search(r"-\d{1,10}-\d{1,10}-\d{1,10}",url)
  40.     if str(type(hi)) == "<class 'NoneType'>":
  41.         return True
  42.     else:
  43.         return False
  44.    
  45. def conversion(url,n): #转化网址
  46.     t = []
  47.     for i in url:
  48.         t.append(i)

  49.     t = ''.join(t[:n])
  50.     return t
  51.    
  52. if __name__ == "__main__":
  53.     url = input("爬取评论代码(仅支持鱼C网址(有几率此永网页无法爬取)):")
  54.     num = get_num(url)#共几页
  55.     file_name = input("请输入需要保存的文件名(注意:不加后缀名(Friendly tips:文件名相同会覆盖原文件)):")
  56.     file_name = file_name + ".txt"

  57.     c = []
  58.     what = []
  59.     if decide(url = url):
  60.         for i in range(int(num)):
  61.             page = str(i+1)
  62.             what.append(main(url = url + '&page=' + page,c=c))

  63.         keep(file_name = file_name,what=what)#保存
  64.     else:
  65.         for i in range(int(num)):
  66.             page = str(i+1)
  67.             what.append(main(url = conversion(url = url,n = -8) + page + '-1.html' ,c=c))

  68.         keep(file_name = file_name,what=what)

复制代码
最佳答案
2021-7-7 09:23:24
3236654291 发表于 2021-7-7 14:25
我试过了呀
而且我发现把get_num()函数里的res.text打印出来
里面没有我要的总页数



试试这样,把 get_num 函数用下面这个:

  1. def get_num(url):  # 获取页数
  2.     headers = {}
  3.     headers[
  4.         'User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
  5.     headers[
  6.         'Cookie'] = 'oMVX_2132_saltkey=ttg5geGa; oMVX_2132_lastvisit=1624623545; oMVX_2132_auth=e37dd2PNTtpF1Tmt8%2ByP9Gr%2FAvWXweScDEHJ%2FLzFd%2BRFC41%2B81vqg6h784Xa9sG54pMFBr5hP6zGidNgcAEswvrcJ20; oMVX_2132_lastcheckfeed=854664%7C1624627154; oMVX_2132_atarget=1; oMVX_2132_smile=6D1; oMVX_2132_atlist=9; oMVX_2132_sid=ng222X; oMVX_2132_lip=112.50.189.204%2C1625758119; PHPSESSID=mgj7l3mkssq70ag6j7es6d99c6; oMVX_2132_ulastactivity=5838ck7yBFJTgL%2FfVWUcIRaHqSxszsrs5%2BXWC79MnRSl6WIDnqvg; acw_tc=781bad0816257949619514154e4035c90ea37235db295b2910fb132ac7402f; oMVX_2132_noticeTitle=1; oMVX_2132_st_t=854664%7C1625795252%7Cc6611f7934f6145aef7408b139f83a36; oMVX_2132_forum_lastvisit=D_173_1625795252; oMVX_2132_sendmail=1; oMVX_2132_visitedfid=354D173D188D38D125D149D33D171D219D84; oMVX_2132_viewid=tid_107659; oMVX_2132_checkpm=1; oMVX_2132_st_p=854664%7C1625795751%7C0c786b5b745132c2deaf8706e2812eb0; _fmdata=XqSSl%2FXqDIZ5Fsa93RNAO8xoew4KAZ%2FpoIVHewJ6UI5D5tAk0kpV1t58NSABPaVntgjj6GUetZ4oh6vFn%2BMgOCfhUZpBPvGfFCQuOJMGw3g%3D; oMVX_2132_lastact=1625795752%09misc.php%09patch'

  7.     res = requests.get(url, headers=headers)
  8.     soup = bs4.BeautifulSoup(res.text, "html.parser")

  9.     number = []
  10.     target = soup.find_all("div", class_="pg")
  11.     for each in target:
  12.         number.append(each.label.span.text)
  13.     nr = re.search(r"\d+", str(number))
  14.     print(number)
  15.     if str(type(nr)) == "<class 'NoneType'>":
  16.         return "1"
  17.     else:
  18.         return str(nr.group())
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2021-7-7 10:37:06 | 显示全部楼层
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2021-7-7 10:58:57 | 显示全部楼层


获取网页数的 get_num 函数中,在 headers 中加上 Cookie 即可,例(Cookie 自己去浏览器中获取):

  1. headers['Cookie'] = '输入浏览器访问时的 Cookie'
复制代码


小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2021-7-7 14:25:07 | 显示全部楼层
Twilight6 发表于 2021-7-7 10:58
获取网页数的 get_num 函数中,在 headers 中加上 Cookie 即可,例(Cookie 自己去浏览器中获取):

...

我试过了呀
而且我发现把get_num()函数里的res.text打印出来
里面没有我要的总页数
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2021-7-7 09:23:24 | 显示全部楼层    本楼为最佳答案   
3236654291 发表于 2021-7-7 14:25
我试过了呀
而且我发现把get_num()函数里的res.text打印出来
里面没有我要的总页数



试试这样,把 get_num 函数用下面这个:

  1. def get_num(url):  # 获取页数
  2.     headers = {}
  3.     headers[
  4.         'User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'
  5.     headers[
  6.         'Cookie'] = 'oMVX_2132_saltkey=ttg5geGa; oMVX_2132_lastvisit=1624623545; oMVX_2132_auth=e37dd2PNTtpF1Tmt8%2ByP9Gr%2FAvWXweScDEHJ%2FLzFd%2BRFC41%2B81vqg6h784Xa9sG54pMFBr5hP6zGidNgcAEswvrcJ20; oMVX_2132_lastcheckfeed=854664%7C1624627154; oMVX_2132_atarget=1; oMVX_2132_smile=6D1; oMVX_2132_atlist=9; oMVX_2132_sid=ng222X; oMVX_2132_lip=112.50.189.204%2C1625758119; PHPSESSID=mgj7l3mkssq70ag6j7es6d99c6; oMVX_2132_ulastactivity=5838ck7yBFJTgL%2FfVWUcIRaHqSxszsrs5%2BXWC79MnRSl6WIDnqvg; acw_tc=781bad0816257949619514154e4035c90ea37235db295b2910fb132ac7402f; oMVX_2132_noticeTitle=1; oMVX_2132_st_t=854664%7C1625795252%7Cc6611f7934f6145aef7408b139f83a36; oMVX_2132_forum_lastvisit=D_173_1625795252; oMVX_2132_sendmail=1; oMVX_2132_visitedfid=354D173D188D38D125D149D33D171D219D84; oMVX_2132_viewid=tid_107659; oMVX_2132_checkpm=1; oMVX_2132_st_p=854664%7C1625795751%7C0c786b5b745132c2deaf8706e2812eb0; _fmdata=XqSSl%2FXqDIZ5Fsa93RNAO8xoew4KAZ%2FpoIVHewJ6UI5D5tAk0kpV1t58NSABPaVntgjj6GUetZ4oh6vFn%2BMgOCfhUZpBPvGfFCQuOJMGw3g%3D; oMVX_2132_lastact=1625795752%09misc.php%09patch'

  7.     res = requests.get(url, headers=headers)
  8.     soup = bs4.BeautifulSoup(res.text, "html.parser")

  9.     number = []
  10.     target = soup.find_all("div", class_="pg")
  11.     for each in target:
  12.         number.append(each.label.span.text)
  13.     nr = re.search(r"\d+", str(number))
  14.     print(number)
  15.     if str(type(nr)) == "<class 'NoneType'>":
  16.         return "1"
  17.     else:
  18.         return str(nr.group())
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2021-7-9 11:22:46 | 显示全部楼层
Twilight6 发表于 2021-7-9 09:58
试试这样,把 get_num 函数用下面这个:


可以可以
这个好啊
谢谢大佬
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-6-22 01:12

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表