[已解决]python访问鱼c提示的“utf-8”解码问题

鬼少 · 发表于 2024-11-16 22:59:38

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

当执行如下代码时：(由于没有发送带url的权限的帖子，链接中间用空格隔开)
import urllib.request

resp = urllib.request.urlopen('https:// fishc . com. cn')
html = resp.read()
html = html.decode("utf-8")
#解码，将其变为unicode编码，即还原为带中文的html代码
print(html)

运行上述代码提示如下错误：
Traceback (most recent call last):
File "D:/program/python_code/14_1.py", line 5, in <module>
html = html.decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd3 in position 252: invalid continuation byte

请大神指点下错误原因，谢谢

最佳答案

月排行榜 / 总排行榜

jackz007

2024-11-16 23:27:22

本帖最后由 jackz007 于 2024-11-16 23:32 编辑

html = html.decode("utf-8")

复制代码

改为：

html = html.decode("gbk")

复制代码

import urllib . request
resp = urllib . request . urlopen('https://fishc.com.cn')
html = resp . read()
html = html . decode('gbk')
fp = open('fishc.html' , 'w')
fp . write(html)
fp . close()

复制代码

在源代码同目录下，用 Chrome 或 EDGE 浏览器打开新生成的 'fishc.html' 文件

跳转到最佳答案楼层

FishC · 发表于 2024-11-16 23:03:06

您遇到的错误是由于尝试使用 `utf-8` 解码一个实际上不是完整 `utf-8` 编码的数据流，这可能是因为网页内容使用了其他编码格式或者数据不完整。

您可以通过以下方法更健壮地获取网页内容：

1. **自动检测网页编码**：使用 `chardet` 库检测编码。

首先安装 `chardet`：

pip install chardet

复制代码

然后使用它来检测编码：

import urllib.request
import chardet
resp = urllib.request.urlopen('https://fishc.com.cn')
html = resp.read()
# 检测编码
result = chardet.detect(html)
encoding = result['encoding']
# 使用检测到的编码进行解码
if encoding is not None:
html = html.decode(encoding)
print(html)
else:
print("无法检测到编码")

复制代码

2. **直接使用 `requests` 库**，它会自动帮你处理编码问题：

首先安装 `requests`：

pip install requests

复制代码

然后使用它来获取网页内容：

import requests
response = requests.get('https://fishc.com.cn')
response.encoding = response.apparent_encoding # 自动设定编码
html = response.text
print(html)

复制代码

使用这些方法可以更安全地处理网页内容的编码问题。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

jackz007 · 发表于 2024-11-16 23:27:22

这个最佳答案由 jackz007 给出，感谢 jackz007 的回答。

单击隐藏图章

本帖最后由 jackz007 于 2024-11-16 23:32 编辑

html = html.decode("utf-8")

复制代码

改为：

html = html.decode("gbk")

复制代码

import urllib . request
resp = urllib . request . urlopen('https://fishc.com.cn')
html = resp . read()
html = html . decode('gbk')
fp = open('fishc.html' , 'w')
fp . write(html)
fp . close()

复制代码

在源代码同目录下，用 Chrome 或 EDGE 浏览器打开新生成的 'fishc.html' 文件

鬼少 · 发表于 2024-11-16 23:42:53

FishC 发表于 2024-11-16 23:03
您遇到的错误是由于尝试使用 `utf-8` 解码一个实际上不是完整 `utf-8` 编码的数据流，这可能是因为网页内容 ...

在使用了第一种方式检测编码时，运行如下代码：
import urllib.request
import chardet #检测编码

resp = urllib.request.urlopen('https://fishc.com.cn')
html = resp.read()

#检测编码
result = chardet.detect(html)
print(result)
encoding = result['encoding']

#使用检测到的编码进行解码
if encoding is not None:
html = html.decode(encoding)
print(html)
else:
print('无法检测到编码')

提示如下错误：
Traceback (most recent call last):
File "D:/program/python_code/14_1.py", line 14, in <module>
html = html.decode(encoding)
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xd8 in position 30191: illegal multibyte sequence
decoding with 'GB2312' codec failed

FishC · 发表于 2024-11-16 23:42:59

鬼少发表于 2024-11-16 23:42
在使用了第一种方式检测编码时，运行如下代码：
impo ...

在执行您提供的代码时，检测编码部分有一些小问题。您需要从 `chardet.detect()` 的结果中正确地提取编码信息，并在解码时提供正确的编码值。以下是调整后的代码：

import urllib.request
import chardet # 用于检测编码
resp = urllib.request.urlopen('https://fishc.com.cn')
html = resp.read()
# 检测编码
result = chardet.detect(html)
print(result)
encoding = result['encoding'] # 确保这里提取的是正确的编码
# 使用检测到的编码进行解码
if encoding is not None:
html = html.decode(encoding)
print(html)
else:
print('无法检测到编码')

复制代码

关键的修正是在检测结果 `result` 获取 `encoding` 的时候，要访问字典的 `encoding` 键，即 `result['encoding']`，这样您就可以获得正确的编码方式来解码网页内容。希望这样能够帮助您解决问题！

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

鬼少 · 发表于 2024-11-16 23:52:54

jackz007 发表于 2024-11-16 23:27
改为：

多谢大神，试问‘gbk’和‘gb2312’ 有什么区别？为什么用gb2312 还是会提示错误？

FishC · 发表于 2024-11-16 23:53:00

鬼少发表于 2024-11-16 23:52
多谢大神，试问‘gbk’和‘gb2312’ 有什么区别？ ...

`gb2312` 和 `gbk` 都是用于中文字符的编码标准，但它们有一些区别：

1. **编码范围**：
- `GB2312` 是较早的中文字符集，包含7000多个常用汉字和少量符号，主要用于简体中文。
- `GBK` 是 `GB2312` 的扩展，支持的字符更多，包含近2万个汉字，包括简体和繁体字符，此外还增加了更多符号。

2. **兼容性**：
- `GB2312` 含有的字符是 `GBK` 的一个子集。因此，`GBK` 能够向后兼容 `GB2312`。如果一个文本是 `GB2312` 编码，可以用 `GBK` 正常解码，但反过来不一定成立。

至于为什么用 `gb2312` 会出现错误，可能是因为网页实际上使用了 `GBK` 编码（或其他扩展字符），而这些在 `GB2312` 中无法找到。

如果检测到 `gb2312` 仍出现问题，试试使用 `gbk` 进行解码：

html = html.decode('gbk')

复制代码

通常使用 `gbk` 来处理问题会更好，因为它是 `GB2312` 的超集，并且更具灵活性。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

鬼少 · 发表于 2024-11-16 23:53:33

FishC 发表于 2024-11-16 23:03
您遇到的错误是由于尝试使用 `utf-8` 解码一个实际上不是完整 `utf-8` 编码的数据流，这可能是因为网页内容 ...

在使用了第一种方式检测编码时，运行如下代码：
import urllib.request
import chardet #检测编码

resp = urllib.request.urlopen('https://fishc.com.cn')
html = resp.read()

#检测编码
result = chardet.detect(html)
print(result)
encoding = result['encoding']

#使用检测到的编码进行解码
if encoding is not None:
html = html.decode(encoding)
print(html)
else:
print('无法检测到编码')

提示如下错误：
Traceback (most recent call last):
File "D:/program/python_code/14_1.py", line 14, in <module>
html = html.decode(encoding)
UnicodeDecodeError: 'gb2312' codec can't decode byte 0xd8 in position 30191: illegal multibyte sequence
decoding with 'GB2312' codec failed

FishC · 发表于 2024-11-16 23:53:51

鬼少发表于 2024-11-16 23:53
在使用了第一种方式检测编码时，运行如下代码：
impo ...

在您的代码中，您需要从 `chardet.detect()` 的结果中正确提取编码信息。以下是修正后的代码：

import urllib.request
import chardet # 用于检测编码
resp = urllib.request.urlopen('https://fishc.com.cn')
html = resp.read()
# 检测编码
result = chardet.detect(html)
print(result)
encoding = result['encoding'] # 确保这里提取的是正确的编码
# 使用检测到的编码进行解码
if encoding is not None:
html = html.decode(encoding)
print(html)
else:
print('无法检测到编码')

复制代码

在 `result` 中，`encoding` 是一个键，您需要通过 `result['encoding']` 来获取编码字符串。

如果检测到的编码是 `gb2312` 并且解码失败，您可以尝试使用 `gbk` 进行解码，因为 `gbk` 是 `gb2312` 的超集，能够处理更多的字符：

html = html.decode('gbk')

复制代码

这样可以更好地处理网页内容的编码问题。

以上回复来自 FishC AI 小助理，如未能正确解答您的问题，请继续追问。

账号		自动登录	找回密码
密码			立即注册

[已解决]python访问鱼c提示的“utf-8”解码问题

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块