[已解决]053讲关于查看网站的编码

大忽悠喵 · 发表于 2019-3-13 19:35:06

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

这段代码用来检测指定 URL 的编码
import urllib.request
import chardet

def main():
url = input("请输入URL：")

response = urllib.request.urlopen(url)
html = response.read()

# 识别网页编码
encode = chardet.detect(html)['encoding']
if encode == 'GB2312':
      encode = 'GBK'

print("该网页使用的编码是：%s" % encode)

if __name__ == "__main__":
main()

运行程序后是
请输入URL：https://fishc.com.cn
该网页使用的编码是：Windows-1254

但是我查看网页的源代码里面写的是<meta http-equiv="Content-Type" content="text/html; charset=gbk" />，应该是gbk编码啊，为啥程序检测显示的是Windows-1254

最佳答案

月排行榜 / 总排行榜

ba21

2019-3-13 19:52:30

源码头中的编码不一定正确，网页加密传输后不可能是源码头中的编码。

所以识别后的编码为准，chardet识别可能也有很大的误差，推荐使用cchardet
代码：

import urllib.request
import cchardet
def main():
url = input("请输入URL：")
response = urllib.request.urlopen(url)
html = response.read()
# 识别网页编码
enc = cchardet.detect(html)
enc = enc['encoding']
print("该网页使用的编码是：%s" % enc)
if __name__ == "__main__":
main()

复制代码

跳转到最佳答案楼层

ba21 · 发表于 2019-3-13 19:52:30

源码头中的编码不一定正确，网页加密传输后不可能是源码头中的编码。

所以识别后的编码为准，chardet识别可能也有很大的误差，推荐使用cchardet
代码：

import urllib.request
import cchardet
def main():
url = input("请输入URL：")
response = urllib.request.urlopen(url)
html = response.read()
# 识别网页编码
enc = cchardet.detect(html)
enc = enc['encoding']
print("该网页使用的编码是：%s" % enc)
if __name__ == "__main__":
main()

复制代码

账号		自动登录	找回密码
密码			立即注册

[已解决]053讲关于查看网站的编码

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块