爬取B站评论，下面代码只能爬取一级评论，求指导以下如何爬取交互回复信息哇.,Python交流,编程语言专区,鱼C论坛 - Powered by Discuz! Archiver

论坛 › Python交流 › 爬取B站评论，下面代码只能爬取一级评论，求指导以下如何爬取交互回复信息哇.

不会改名的吧 发表于 2023-11-10 20:19:44

爬取B站评论，下面代码只能爬取一级评论，求指导以下如何爬取交互回复信息哇.

import requests
import re
import time
import csv

# 消息头信息
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
}

# 获取评论API
original_url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid={}&mode=3'

# 时间戳转换成日期
def get_time(ctime):
timeArray = time.localtime(ctime)
otherStyleTime = time.strftime("%Y.%m.%d", timeArray)
return str(otherStyleTime)

# 获取aid
def get_oid(bvid):
video_url = 'https://www.bilibili.com/video/' + bvid
page = requests.get(video_url, headers=header).text
aid = re.search(r'"aid":+', page).group()
return aid

# 边爬取评论边保存文件
def online_save(bvid):
all_count = 0
oid = get_oid(bvid)
page = 1
url = original_url.format(page, oid)
html = requests.get(url, headers=header)
data = html.json()
count = int(data['data']['cursor']['all_count'])
fname = bvid + '_评论.csv'
with open(fname, 'w+', newline='', encoding='utf_8_sig') as f:
   csv_writer = csv.writer(f)
   csv_writer.writerow(["时间", "点赞", "用户名", "评论"])# Added "用户名" header
   for i in data['data']['replies']:
         message = i['content']['message']
         message = re.sub('\s+', '', message)
         ctime = get_time(i['ctime'])
         like = i['like']
         username = i['member']['uname']# Added to get username
         csv_writer.writerow()# Added username
         all_count = all_count + 1

         # Check for and collect reply comments
         if 'replies' in i:
            for reply in i['replies']:
               reply_message = reply['content']['message']
               reply_message = re.sub('\s+', '', reply_message)
               reply_like = reply['like']
               reply_username = reply['member']['uname']
               csv_writer.writerow(["REPLY", str(reply_like), reply_username, reply_message])

   print('总评论数：{}，当前评论数:{},爬取Page{}完毕。'.format(count, all_count, page))
   time.sleep(5)
   while all_count < count:
         page += 1
         url = original_url.format(page, oid)
         try:
            html = requests.get(url, headers=header)
            data = html.json()
            for i in data['data']['replies']:
               message = i['content']['message']
               ctime = get_time(i['ctime'])
               like = i['like']
               username = i['member']['uname']
               csv_writer.writerow()
               all_count = all_count + 1

               # Check for and collect reply comments
               if 'replies' in i:
                     for reply in i['replies']:
                        reply_message = reply['content']['message']
                        reply_message = re.sub('\s+', '', reply_message)
                        reply_like = reply['like']
                        reply_username = reply['member']['uname']
                        csv_writer.writerow(["REPLY", str(reply_like), reply_username, reply_message])

            print('总评论数：{}，当前评论数:{},爬取Page{}完毕。'.format(count, all_count, page))
            time.sleep(5)
         except:
            break
   f.close()

if __name__ == '__main__':
bvid = input('输入视频Bvid:')
online_save(bvid)
print('完成！')

isdkz 发表于 2023-11-10 20:20:08

这段代码存在一个问题，即它只能爬取一级评论而无法获取交互回复信息。要解决这个问题，你需要在获取一级评论后，检查每个评论是否有回复，如果有的话，再获取回复的内容。你可以修改代码如下：

# 边爬取评论边保存文件
def online_save(bvid):
all_count = 0
oid = get_oid(bvid)
page = 1
url = original_url.format(page, oid)
fname = bvid + '_评论.csv'
with open(fname, 'w+', newline='', encoding='utf_8_sig') as f:
   csv_writer = csv.writer(f)
   csv_writer.writerow(["时间", "点赞", "用户名", "评论", "回复类型", "回复点赞", "回复用户名", "回复内容"])# 添加回复信息的表头
   while True:
         html = requests.get(url, headers=header)
         data = html.json()
         count = int(data['data']['cursor']['all_count'])
         for i in data['data']['replies']:
            message = i['content']['message']
            message = re.sub('\s+', '', message)
            ctime = get_time(i['ctime'])
            like = i['like']
            username = i['member']['uname']
            csv_writer.writerow()# 将回复信息留空
            all_count += 1

            # Check for and collect reply comments
            if 'replies' in i:
               for reply in i['replies']:
                     reply_message = reply['content']['message']
                     reply_message = re.sub('\s+', '', reply_message)
                     reply_like = reply['like']
                     reply_username = reply['member']['uname']
                     csv_writer.writerow(["", "", "", "", "回复", str(reply_like), reply_username, reply_message])# 添加回复信息
                     all_count += 1

         print('总评论数：{}，当前评论数:{},爬取Page{}完毕。'.format(count, all_count, page))
         if all_count >= count:
            break
         page += 1
         url = original_url.format(page, oid)
         time.sleep(5)
print('完成！')

这样修改后，代码将会在爬取一级评论的同时，检查是否有回复评论，如果有的话，也会将回复的内容一起保存到csv文件中。

不会改名的吧 发表于 2023-11-10 20:33:07

isdkz 发表于 2023-11-10 20:20
这段代码存在一个问题，即它只能爬取一级评论而无法获取交互回复信息。要解决这个问题，你需要在获取一级评 ...

import requests
import re
import time
import csv

# 消息头信息
header = {
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36',
}

# 获取评论API
original_url = 'https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next={}&type=1&oid={}&mode=3'

# 时间戳转换成日期
def get_time(ctime):
timeArray = time.localtime(ctime)
otherStyleTime = time.strftime("%Y.%m.%d", timeArray)
return str(otherStyleTime)

# 获取aid
def get_oid(bvid):
video_url = 'https://www.bilibili.com/video/' + bvid
page = requests.get(video_url, headers=header).text
aid = re.search(r'"aid":+', page).group()
return aid

# 边爬取评论边保存文件
def online_save(bvid):
all_count = 0
oid = get_oid(bvid)
page = 1
url = original_url.format(page, oid)
fname = bvid + '_评论.csv'
with open(fname, 'w+', newline='', encoding='utf_8_sig') as f:
   csv_writer = csv.writer(f)
   csv_writer.writerow(["时间", "点赞", "用户名", "评论", "回复类型", "回复点赞", "回复用户名", "回复内容"])# 添加回复信息的表头
   while True:
         html = requests.get(url, headers=header)
         data = html.json()
         count = int(data['data']['cursor']['all_count'])
         for i in data['data']['replies']:
            message = i['content']['message']
            message = re.sub('\s+', '', message)
            ctime = get_time(i['ctime'])
            like = i['like']
            username = i['member']['uname']
            csv_writer.writerow()# 将回复信息留空
            all_count += 1

            # Check for and collect reply comments
            if 'replies' in i:
               for reply in i['replies']:
                     reply_message = reply['content']['message']
                     reply_message = re.sub('\s+', '', reply_message)
                     reply_like = reply['like']
                     reply_username = reply['member']['uname']
                     csv_writer.writerow(["", "", "", "", "回复", str(reply_like), reply_username, reply_message])# 添加回复信息
                     all_count += 1

         print('总评论数：{}，当前评论数:{},爬取Page{}完毕。'.format(count, all_count, page))
         if all_count >= count:
            break
         page += 1
         url = original_url.format(page, oid)
         time.sleep(5)

if __name__ == '__main__':
bvid = input('输入视频Bvid:')
online_save(bvid)
print('完成！')

不会改名的吧 发表于 2023-11-10 20:34:06

isdkz 发表于 2023-11-10 20:20
这段代码存在一个问题，即它只能爬取一级评论而无法获取交互回复信息。要解决这个问题，你需要在获取一级评 ...

按照这样子，，我发现代码停不下来了
输入视频Bvid:BV1Hw411q7PV
总评论数：170，当前评论数:38,爬取Page1完毕。
总评论数：170，当前评论数:61,爬取Page2完毕。
总评论数：170，当前评论数:83,爬取Page3完毕。
总评论数：170，当前评论数:103,爬取Page4完毕。
总评论数：170，当前评论数:123,爬取Page5完毕。
总评论数：170，当前评论数:141,爬取Page6完毕。
总评论数：170，当前评论数:141,爬取Page7完毕。
总评论数：170，当前评论数:141,爬取Page8完毕。
总评论数：170，当前评论数:141,爬取Page9完毕。
总评论数：170，当前评论数:141,爬取Page10完毕。
总评论数：170，当前评论数:141,爬取Page11完毕。
总评论数：170，当前评论数:141,爬取Page12完毕。
总评论数：170，当前评论数:141,爬取Page13完毕。
总评论数：170，当前评论数:141,爬取Page14完毕。
总评论数：170，当前评论数:141,爬取Page15完毕。
总评论数：170，当前评论数:141,爬取Page16完毕。
总评论数：170，当前评论数:141,爬取Page17完毕。
总评论数：170，当前评论数:141,爬取Page18完毕。
总评论数：170，当前评论数:141,爬取Page19完毕。
总评论数：170，当前评论数:141,爬取Page20完毕。
总评论数：170，当前评论数:141,爬取Page21完毕。
总评论数：170，当前评论数:141,爬取Page22完毕。
总评论数：170，当前评论数:141,爬取Page23完毕。
总评论数：170，当前评论数:141,爬取Page24完毕。
总评论数：170，当前评论数:141,爬取Page25完毕。
总评论数：170，当前评论数:141,爬取Page26完毕。
总评论数：170，当前评论数:141,爬取Page27完毕。
总评论数：170，当前评论数:141,爬取Page28完毕。
总评论数：170，当前评论数:141,爬取Page29完毕。
总评论数：170，当前评论数:141,爬取Page30完毕。
总评论数：170，当前评论数:141,爬取Page31完毕。
总评论数：170，当前评论数:141,爬取Page32完毕。
总评论数：170，当前评论数:141,爬取Page33完毕。
总评论数：170，当前评论数:141,爬取Page34完毕。
总评论数：170，当前评论数:141,爬取Page35完毕。
总评论数：170，当前评论数:141,爬取Page36完毕。
总评论数：170，当前评论数:141,爬取Page37完毕。
总评论数：170，当前评论数:141,爬取Page38完毕。
总评论数：170，当前评论数:141,爬取Page39完毕。
总评论数：170，当前评论数:141,爬取Page40完毕。
总评论数：170，当前评论数:141,爬取Page41完毕。
总评论数：170，当前评论数:141,爬取Page42完毕。
总评论数：170，当前评论数:141,爬取Page43完毕。

梅开朗基罗 发表于 2023-12-17 21:51:33

如果要把爬取的数据存入Mongodb数据库该怎么处理呢@FishC

FishC 发表于 2023-12-17 21:51:51

梅开朗基罗发表于 2023-12-17 21:51
如果要把爬取的数据存入Mongodb数据库该怎么处理呢@Fi ...

要将爬取的数据存入MongoDB数据库，你需要首先安装MongoDB和Python的MongoDB驱动程序（pymongo）。下面是一个简单的示例代码，展示了如何将数据存入MongoDB数据库：

首先，确保你已经安装好了MongoDB和pymongo。你可以使用以下命令来安装pymongo:

pip install pymongo

接下来，导入pymongo库并连接到MongoDB数据库：

from pymongo import MongoClient

# 连接到MongoDB数据库
client = MongoClient('mongodb://localhost:27017/')

# 选择或创建一个数据库
db = client['mydatabase']

# 选择或创建一个集合（类似于关系数据库中的表）
collection = db['mycollection']

现在，你可以将爬取的数据转换为字典，然后使用`insert_one()`方法将数据插入到集合中：

# 假设你有一条爬取到的数据
data = {'name': 'Alice', 'age': 25, 'city': 'Shanghai'}

# 将数据插入到集合中
collection.insert_one(data)

这样就把一条数据存入到MongoDB数据库中了。如果你有多个数据需要插入，可以使用`insert_many()`方法。

另外，如果你想读取数据，可以使用`find()`方法来查询集合中的数据。例如：

# 查询集合中的所有数据
results = collection.find()

# 遍历结果集打印数据
for result in results:
print(result)

希望这个例子对你有所帮助！请记得在使用之前先安装好MongoDB和pymongo库。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

页: [1]

查看完整版本: 爬取B站评论，下面代码只能爬取一级评论，求指导以下如何爬取交互回复信息哇.