[已解决]有一个站我抓取后无论用GBK还是UTF-8解码都报错

a1104201 · 发表于 2016-3-19 17:45:52

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

站点地址
http://mm.xmeise.com/xingge/shunv/2634.html
下面是我写的代码：
#-*- coding:UTF-8 -*-;
import urllib.request;
import urllib.parse;
import json;
import os;
import urllib.error;
import http.client;

def webHttp(url, dataType=False, charset="UTF-8"):
req = urllib.request.Request(url);
req.add_header("User-Agent","Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36");

response = urllib.request.urlopen(req);
data = response.read();
if dataType == False:
      try:
         html = data.decode(charset);
      except UnicodeDecodeError:
         html = data.decode("GBK");
else:
      html = data;

return html;

url = "http://mm.xmeise.com/xingge/shunv/2634.html";
html = webHttp(url);
print(html);

无论我拿GBK，GB2312，UTF-8解码都报如下错误：
Traceback (most recent call last):
  File "E:/Python/抓美女/抓图片.py", line 17, in webHttp
html = data.decode(charset);
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/Python/抓美女/抓图片.py", line 26, in <module>
html = webHttp(url);
  File "E:/Python/抓美女/抓图片.py", line 19, in webHttp
html = data.decode("GBK");
UnicodeDecodeError: 'gbk' codec can't decode byte 0x8b in position 1: illegal multibyte sequence

但是改程序我抓取别的网站地址却正确，唯独就是这个站出现这个问题http://mm.xmeise.com/xingge/shunv/2634.html

最佳答案

月排行榜 / 总排行榜

hldh214

2016-3-19 17:56:58

实测用gbk来decode是可以的,

import requests
req = requests.get('http://mm.xmeise.com/xingge/shunv/2634.html')
print(req._content.decode('gbk'))

复制代码

跳转到最佳答案楼层

hldh214 · 发表于 2016-3-19 17:56:58

这个最佳答案由 hldh214 给出，感谢 hldh214 的回答。

单击隐藏图章

实测用gbk来decode是可以的,

import requests
req = requests.get('http://mm.xmeise.com/xingge/shunv/2634.html')
print(req._content.decode('gbk'))

复制代码

a1104201 · 发表于 2016-3-23 23:55:05

hldh214 发表于 2016-3-19 17:56
实测用gbk来decode是可以的,

你这段代码确实可以，但是为什么我上面的那段代码不行，请赐教

hldh214 · 发表于 2016-3-24 10:38:08

a1104201 发表于 2016-3-23 23:55
你这段代码确实可以，但是为什么我上面的那段代码不行，请赐教

这个网站强制开启了gzip压缩, 用urllib的时候要用gzip模块解压一下就行了

import gzip
data = gzip.decompress(data)

复制代码

账号		自动登录	找回密码
密码			立即注册