[已解决]53课问题求解

pythonDemo · 发表于 2017-8-9 21:25:39

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

为甚么53课课后作业最后一题我编码hao123，youku ,163等的编码后，文本是这样

而鱼c官网的编码正常

111

最佳答案

月排行榜 / 总排行榜

shinemic

2017-8-10 10:27:46

本帖最后由 shinemic 于 2017-8-10 10:34 编辑

首先把问题复现。为了方便起见，我写了个函数，直接把「解码后」的字符串写入文件：

import urllib.request
fishC = 'http://www.fishC.com'
hao123 = 'http://www.hao123.com'
youku = 'http://www.youku.com'
netbase = 'http://www.163.com'
def write_html_to_file(url, filename, decode = 'utf-8'):
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode(decode)
fstream = open(filename + '.html', 'w')
fstream.write(html)
fstream.close

复制代码

进行试验：

write_html_to_file(fishC, 'fishC')
write_html_to_file(hao123, 'hao123')
write_html_to_file(youku, 'youku')
write_html_to_file(netbase, 'netbase')

复制代码

试验其中的 fishC，hao123，youku都没有问题，但网易这里开始报错：

---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-17-7b03f968a2d0> in <module>()
11 fstream.close
12
---> 13 write_html_to_file(netbase, 'netbase')
<ipython-input-17-7b03f968a2d0> in write_html_to_file(url, filename, decode)
6 response = urllib.request.urlopen(url)
7 html = response.read()
----> 8 html = html.decode(decode)
9 fstream = open(filename + '.html', 'w')
10 fstream.write(html)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 565: invalid continuation byte

复制代码

百度谷歌一通，发现是网页编码的问题。打开网易，F12打开开发者控制台(Chrome 浏览器)，发现有这么一句话：

<meta http-equiv="Content-Type" content="text/html; charset=gbk">

复制代码

说明网页以 gbk 进行编码的. 所以我写的函数参数里面有个 decode 参数, 方便后来更改：

write_html_to_file(netbase, 'netbase', 'gbk')

复制代码

这样一来就没有问题了(netbase.html文件节选)：

<!DOCTYPE HTML>
<html phone="1" id="ne_wrap">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gbk">
<meta name="model_url" content="http://www.163.com/special/0077rt/index.html" />
<title>网易</title>
<base target="_blank" />
<meta name="Keywords" content="网易,邮箱,游戏,新闻,体育,娱乐,女性,亚运,论坛,短信,数码,汽车,手机,财经,科技,相册" />
<meta name="Description" content="网易是中国领先的互联网技术公司，为用户提供免费邮箱、游戏、搜索引擎服务，开设新闻、娱乐、体育等30多个内容频道，及博客、视频、论坛等互动交流，网聚人的力量。" />

复制代码

但是如果直接双击这个文件，发现是乱码：

当然这属于浏览器 / 前端的问题，不是 Python 的讨论范围内了，修改方法是把网页编码改为 UTF-8 即可：

跳转到最佳答案楼层

pythonDemo · 发表于 2017-8-9 23:18:18

终端下的错误显示是：UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 565: invalid continuation byte

shinemic · 发表于 2017-8-10 10:27:46

这个最佳答案由 shinemic 给出，感谢 shinemic 的回答。

单击隐藏图章

本帖最后由 shinemic 于 2017-8-10 10:34 编辑

首先把问题复现。为了方便起见，我写了个函数，直接把「解码后」的字符串写入文件：

import urllib.request
fishC = 'http://www.fishC.com'
hao123 = 'http://www.hao123.com'
youku = 'http://www.youku.com'
netbase = 'http://www.163.com'
def write_html_to_file(url, filename, decode = 'utf-8'):
response = urllib.request.urlopen(url)
html = response.read()
html = html.decode(decode)
fstream = open(filename + '.html', 'w')
fstream.write(html)
fstream.close

复制代码

进行试验：

write_html_to_file(fishC, 'fishC')
write_html_to_file(hao123, 'hao123')
write_html_to_file(youku, 'youku')
write_html_to_file(netbase, 'netbase')

复制代码

试验其中的 fishC，hao123，youku都没有问题，但网易这里开始报错：

---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-17-7b03f968a2d0> in <module>()
11 fstream.close
12
---> 13 write_html_to_file(netbase, 'netbase')
<ipython-input-17-7b03f968a2d0> in write_html_to_file(url, filename, decode)
6 response = urllib.request.urlopen(url)
7 html = response.read()
----> 8 html = html.decode(decode)
9 fstream = open(filename + '.html', 'w')
10 fstream.write(html)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 565: invalid continuation byte

复制代码

百度谷歌一通，发现是网页编码的问题。打开网易，F12打开开发者控制台(Chrome 浏览器)，发现有这么一句话：

<meta http-equiv="Content-Type" content="text/html; charset=gbk">

复制代码

说明网页以 gbk 进行编码的. 所以我写的函数参数里面有个 decode 参数, 方便后来更改：

write_html_to_file(netbase, 'netbase', 'gbk')

复制代码

这样一来就没有问题了(netbase.html文件节选)：

<!DOCTYPE HTML>
<html phone="1" id="ne_wrap">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gbk">
<meta name="model_url" content="http://www.163.com/special/0077rt/index.html" />
<title>网易</title>
<base target="_blank" />
<meta name="Keywords" content="网易,邮箱,游戏,新闻,体育,娱乐,女性,亚运,论坛,短信,数码,汽车,手机,财经,科技,相册" />
<meta name="Description" content="网易是中国领先的互联网技术公司，为用户提供免费邮箱、游戏、搜索引擎服务，开设新闻、娱乐、体育等30多个内容频道，及博客、视频、论坛等互动交流，网聚人的力量。" />

复制代码

但是如果直接双击这个文件，发现是乱码：

当然这属于浏览器 / 前端的问题，不是 Python 的讨论范围内了，修改方法是把网页编码改为 UTF-8 即可：

账号		自动登录	找回密码
密码			立即注册