微博评论内容爬取

underwood_yo · 发表于 2023-9-7 12:08:33

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 underwood_yo 于 2023-9-8 17:33 编辑

同学需要爬取微博的评论内容以及时间，给了一个excel文件，包含需要爬取的网址：

登录/注册后可看大图

import numpy as np
import pandas as pd
from selenium import webdriver
import re
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
browser = webdriver.Chrome(options=chrome_options)
start_time = time.time()
def crawer(url):
print('开始爬取'+ url)
browser.get(url)
time.sleep(1)
res = browser.page_source
return res
def execution_data(res):
bs4_res = BeautifulSoup(res, 'html.parser')
text = bs4_res.select('#app > div.lite-page-wrap > div > div.main > div > article > div > div > div.weibo-text')[0].text
created_at = '"created_at": "(.*?)"'
time_ = re.findall(created_at, res,re.S)
return text,time_
data = pd.read_excel('wangzhi.xlsx')
text_all = []
time_all = []
for i in range(data.shape[0]-1):
try:
url = data['网址'].iloc[i]
res = crawer(url)
text = execution_data(res)[0]
time_ = execution_data(res)[1]
text_all.append(text)
time_all.append(time_)
except:
print('第'+ str(i+1) + '个网页出现问题')
print(url)
text = []
time_ = []
text_all.append(text)
time_all.append(time_)
context = pd.DataFrame({'文本':text_all,'时间':time_all})
context.to_excel('微博正文.xlsx')
print('爬取完成')
end_time = time.time()
total_time = end_time - start_time
print("所有任务结束，总耗时为：" + str(total_time))[postbg]bg8.png[/postbg]

复制代码

rtiuyttr · 发表于 2023-9-9 12:01:44

提示: 作者被禁止或删除内容自动屏蔽

账号		自动登录	找回密码
密码			立即注册

rtiuyttr rtiuyttr 当前离线 UID 1458316 日志相册贡献荣誉积分 17 狗仔卡头像被屏蔽	发表于 2023-9-9 12:01:44 \| 显示全部楼层提示: 作者被禁止或删除内容自动屏蔽
	小甲鱼最新课程 -> https://ilovefishc.com
	回复支持反对使用道具举报显身卡

[作品展示] 微博评论内容爬取