[已解决]【小白爬虫问题】转换各式

hello? · 发表于 2022-8-6 23:33:55

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

这段代码第8行有问题

import requests
from bs4 import BeautifulSoup
url="https://s.weibo.com/top/summary?cate=realtimehot"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.77"}
response=requests.get(url,headers=headers)
content=response.content.decode('utf-8')
soup=BeautifulSoup(content,'lxml')

复制代码

报错：

Traceback (most recent call last):
File "D:/py/访问微博热搜.py", line 8, in <module>
content=response.content.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 339: invalid continuation byte

复制代码

把第8行换了后就好了

import requests
from bs4 import BeautifulSoup
url="https://s.weibo.com/top/summary/"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.77"}
response=requests.get(url,headers=headers)
content=response.encoding='utf-8'
soup=BeautifulSoup(content,'lxml')
print(soup)

复制代码

不过打印结果是这个

<html><body><p>utf-8</p></body></html>

复制代码

我想要的是源代码

大佬们能解释一下第八行和第九行是什么意思吗
该怎么处理
谢谢大家

最佳答案

月排行榜 / 总排行榜

liuzhengyuan

2022-8-7 01:41:07

hello? 发表于 2022-8-7 00:16
好像不对头

好像cookie加不加都可以，去掉还是可以打印出来

第八行不懂。。。网页采取的编码一般都是 utf8

第九行就是获取 lxml
还是得加 cookie，不然微博有反爬，中途可以print一下soup的类型，就知道soup是什么了（就是bs4的类）
然后是可以正常使用的

import requests
from bs4 import BeautifulSoup
url="https://s.weibo.com/top/summary?cate=realtimehot"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.77",
"cookie":cookie}
response=requests.get(url,headers=headers)
content = response.text
soup=BeautifulSoup(content,'lxml')
print(type(soup)) # <class 'bs4.BeautifulSoup'>
for i in soup.find_all("td", class_ = "td-02"):
print(i.a.text)

复制代码

跳转到最佳答案楼层

hello? · 发表于 2022-8-6 23:34:34

题目打错了，格式

liuzhengyuan · 发表于 2022-8-6 23:38:05

这个网站登进去的时候好像是要用鼠标手动点一下屏幕才可以正式登上。

liuzhengyuan · 发表于 2022-8-6 23:40:17

加一个cookie 参数就好了

from requests import get
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.77"
,"cookie":cookie
}
res = get("https://s.weibo.com/top/summary?cate=realtimehot", headers = headers)
print(res.text)

复制代码

hello? · 发表于 2022-8-7 00:16:22

liuzhengyuan 发表于 2022-8-6 23:40
加一个cookie 参数就好了

好像不对头

from requests import get
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.77"}
res = get('https://s.weibo.com/top/summary/', headers = headers)
print(res.text)

复制代码

好像cookie加不加都可以，去掉还是可以打印出来
不过我疑惑第题目中8，9行什么意思

content=response.content.decode('utf-8')
soup=BeautifulSoup(content,'lxml')

复制代码

就是这两行
第八行是把格式改为为utf-8吗，response.content.decode('utf-8')和response.encoding('utf-8')有区别没呢
那第九行又是什么意思呢
最主要的是它还报错了
这个代码是我在b站某up的教学视频抄的，时间是2020-4-29
我怀疑是不是bs4更新了，不过官方文档我看不太懂
我本人问题有点多，请大佬包容一下

hello? · 发表于 2022-8-7 01:25:29

liuzhengyuan 发表于 2022-8-6 23:40
加一个cookie 参数就好了

content=response.content.decode('utf-8')

复制代码

我似乎找到答案了，应该是将utf-8改为gbk

liuzhengyuan · 发表于 2022-8-7 01:41:07

这个最佳答案由 liuzhengyuan 给出，感谢 liuzhengyuan 的回答。

单击隐藏图章

hello? 发表于 2022-8-7 00:16
好像不对头

好像cookie加不加都可以，去掉还是可以打印出来

第八行不懂。。。网页采取的编码一般都是 utf8

第九行就是获取 lxml
还是得加 cookie，不然微博有反爬，中途可以print一下soup的类型，就知道soup是什么了（就是bs4的类）
然后是可以正常使用的

import requests
from bs4 import BeautifulSoup
url="https://s.weibo.com/top/summary?cate=realtimehot"
headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 Edg/103.0.1264.77",
"cookie":cookie}
response=requests.get(url,headers=headers)
content = response.text
soup=BeautifulSoup(content,'lxml')
print(type(soup)) # <class 'bs4.BeautifulSoup'>
for i in soup.find_all("td", class_ = "td-02"):
print(i.a.text)

复制代码

liuzhengyuan · 发表于 2022-8-7 02:05:12

hello? 发表于 2022-8-7 01:25
我似乎找到答案了，应该是将utf-8改为gbk

可以！

账号		自动登录	找回密码
密码			立即注册