爬取鱼C论坛问题求助板块,Python交流,编程语言专区,鱼C论坛

smog 发表于 2021-5-17 00:19:24

爬取鱼C论坛问题求助板块

本帖最后由 smog 于 2021-5-17 00:32 编辑

import os

import requests, time# 数据请求
import re# 数据提取

class FishC:
url = 'https://fishc.com.cn/bestanswer.php?mod=huzhu&type=undo&page=' # 鱼C论坛悬赏板块的url最后格式是 page=页数
headers = {
 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36 Edg/90.0.818.56'}# 设置请求头部
# 正则的设置请看下面的格式分析，分别提取是否解决，标题，回答个数，提问时间
reg = '(.*?)|<a href="(https://fishc.com.cn/thread-\d{6}-1-1.html)" target="_blank">(.*?)</a>|(\d+?)|(2021-.*?)'

# 每一个问题的格式都是这样子的
# <tr>
# <td>
# [待解决]
# <a href="https://fishc.com.cn/thread-196132-1-1.html" target="_blank">列表:用for循环出的一组数如何加入到一个空的列表中</a>
# </td>
# <td style="text-align:center;">
# 1
# </td>
# <td style="text-align:center;">
# 2021-05-16 22:12
# </td>
#</tr>

def getHtml(self, url):
 res = requests.get(url)
 code = res.status_code
 print(code)
 if code == 200:
 return res.text
 else:
 return None

def run(self):
 if not os.path.exists('data.csv'):
 self.initial()
 for i in range(11): # 爬个11页不要太多
 url = self.url + str(i + 1) # 获得完整的url
 rawHtml = self.getHtml(url)
 if rawHtml is not None:
 res = re.findall(self.reg, rawHtml)
 res = [(res, res, self.parse(res), res, res) for i in
 range(len(res)) if
 i % 4 == 0]# 将匹配的数据打包成一个个元组并生成一个list，这里建议自己打印输出看看（看看 res = re.findall(self.reg, rawHtml)时res是啥再分析）
 with open('data.csv', 'a', encoding='utf-8') as fp:
 for j in res:
 for k in range(len(j)):
 fp.write(j)
 if k != len(j) - 1:
 fp.write(',')
 fp.write('\n')
 print(len(res))# 可以看到鱼C论坛每页都有15条数据
 print(res)

 time.sleep(5)# 设置请求间隔以免给鱼C论坛造成太大压力，不知道会不会封ip哈哈

def parse(self, str):# 这个是将html实体变为字符（其实可以省略）
 str = re.sub('<', '<', str)
 str = re.sub('>', '>', str)
 return str

def initial(self): # 初始化数据文件
 dataHeads = ["是否解决", "链接", "标题", "回答数", "提问时间"]
 with open('data.csv', 'w', encoding='utf-8') as fp:
 for i in dataHeads:
 fp.write(i)
 if i != dataHeads[-1]:
 fp.write(',')
 else:
 fp.write('\n')

def test(self):# 这是自己调试时写的(先爬了一个页面分析正则)，还是留下来吧，（可以忽略）

 if not os.path.exists('data.csv'):
 self.initial()

 with open('raw.html', 'r', encoding='utf-8') as f: # 这里打开html文件前要先爬一页数据，之前爬html保存到raw.html文件的代码被我删掉了。。。
 txt = f.read()

 res = re.findall(self.reg, txt)
 res = [(res, res, self.parse(res), res, res) for i in
 range(len(res)) if
 i % 4 == 0]
 # res = [{"是否解决": i, "链接": i, "标题": self.parse(i), "回答数": i, "提问时间": i} for i in res]
 with open('data.csv', 'a', encoding='utf-8') as fp:
 for i in res:
 for j in range(len(i)):
 fp.write(i)
 if j != len(i) - 1:
 fp.write(',')
 fp.write('\n')
 print(len(res))
 print(res)

if __name__ == '__main__':
fishc = FishC()
fishc.run()

每次请求输出格式：
200
15
[('待解决', 'https://fishc.com.cn/thread-196112-1-1.html', '操作符问题', '2', '2021-05-16 16:58'), ('待解决', 'https://fishc.com.cn/thread-196109-1-1.html', '碰到bug', '3', '2021-05-16 15:29'), ('待解决', 'https://fishc.com.cn/thread-196107-1-1.html', '初学者求救', '3', '2021-05-16 14:56'), ('待解决', 'https://fishc.com.cn/thread-196105-1-1.html', 'python作业谢谢', '7', '2021-05-16 14:51'), ('待解决', 'https://fishc.com.cn/thread-196104-1-1.html', '请教一下，怎么获取到验证码的请求地址呢', '0', '2021-05-16 14:48'), ('待解决', 'https://fishc.com.cn/thread-196102-1-1.html', '背景虚化问题折磨了两个小时了', '2', '2021-05-16 14:21'), ('待解决', 'https://fishc.com.cn/thread-196101-1-1.html', 'LEGB原则', '1', '2021-05-16 14:01'), ('待解决', 'https://fishc.com.cn/thread-196099-1-1.html', '求助：python中怎么删除索引带单引号的列呢？', '3', '2021-05-16 13:36'), ('待解决', 'https://fishc.com.cn/thread-196098-1-1.html', 'KMP算法求助，大佬们帮忙看看代码有什么错', '0', '2021-05-16 12:50'), ('待解决', 'https://fishc.com.cn/thread-196097-1-1.html', '函数文档', '5', '2021-05-16 12:22'), ('待解决', 'https://fishc.com.cn/thread-196095-1-1.html', '老版本第20课课后题string1找字符问题', '7', '2021-05-16 11:10'), ('待解决', 'https://fishc.com.cn/thread-196080-1-1.html', 'c++ 高精度乘单精度', '1', '2021-05-15 23:30'), ('待解决', 'https://fishc.com.cn/thread-196078-1-1.html', 'position的一些问题', '2', '2021-05-15 22:55'), ('待解决', 'https://fishc.com.cn/thread-196077-1-1.html', '求助！！！求助！！！C语言', '2', '2021-05-15 22:51'), ('待解决', 'https://fishc.com.cn/thread-196076-1-1.html', 'notepad', '2', '2021-05-15 22:37')]

附上一个爬下来的文件地址：
文件地址
数据格式：
是否解决,链接,标题,回答数,提问时间
待解决,https://fishc.com.cn/thread-196130-1-1.html,计算嵌套列表某一层次的元素数量<新人求助>（可以的话麻烦看看我的代码怎么改）,0,2021-05-16 21:41
待解决,https://fishc.com.cn/thread-196129-1-1.html,求助！代理服务器ip设计不成功,0,2021-05-16 21:36
待解决,https://fishc.com.cn/thread-196128-1-1.html,学不进去怎么办？,3,2021-05-16 21:16
待解决,https://fishc.com.cn/thread-196127-1-1.html,大佬们 linux怎么用制作windows的启动盘,4,2021-05-16 21:03
待解决,https://fishc.com.cn/thread-196125-1-1.html,有看不懂的报错，求救,2,2021-05-16 20:20
待解决,https://fishc.com.cn/thread-196123-1-1.html,python作业，刚学，求助,2,2021-05-16 20:03
...

南归发表于 2021-5-17 09:23:58

我记得在哪里看过,不让爬本论坛

smog 发表于 2021-5-18 00:12:07

南归发表于 2021-5-17 09:23
我记得在哪里看过,不让爬本论坛

这样啊。。。{:5_99:}

页: [1]

鱼C论坛's Archiver

爬取鱼C论坛问题求助板块