爬虫,萌新交流区,萌新训练营,鱼C论坛

君子好逑 发表于 2020-8-18 22:33:22

爬虫

import urllib.request

url = 'https://www.biduo.cc/biquge/39_39888/c13353637.html'

headers = {
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN',
'Cache-Control': 'no-cache',
'Connection': 'Keep-Alive',
#'Host': 'www.kanmaoxian.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363'
}

res = urllib.request.Request(url=url,headers=headers)

response = urllib.request.urlopen(res)
print(response.read().decode("gbk",'ignore'))

有大佬能帮忙看看这个程序吗，网页是gbk编码的为什么按gbk解码后还是不行{:10_266:}

Twilight6 发表于 2020-8-18 23:00:35

把这个 headers 去掉：'Accept-Encoding': 'gzip, deflate, br' 即可

参考代码：

import urllib.request

url = 'https://www.biduo.cc/biquge/39_39888/c13353637.html'

headers = {
'Accept-Language': 'zh-CN',
'Cache-Control': 'no-cache',
'Connection': 'Keep-Alive',
#'Host': 'www.kanmaoxian.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363'
}

res = urllib.request.Request(url=url,headers=headers)

response = urllib.request.urlopen(res)
print(response.read().decode("gbk")

君子好逑 发表于 2020-8-18 23:22:16

Twilight6 发表于 2020-8-18 23:00
把这个 headers 去掉：'Accept-Encoding': 'gzip, deflate, br' 即可

参考代码：

大佬，为啥要去掉那个，我整个从审查元素的headers搞下来的

°蓝鲤歌蓝 发表于 2020-8-18 23:26:43

君子好逑发表于 2020-8-18 23:22
大佬，为啥要去掉那个，我整个从审查元素的headers搞下来的

因为你加了那个之后网站给你返回的是经过 GZIP 压缩后的数据，你需要先解压再解码才能呈现出真正的内容。
所以有三种方法可以解决你的问题。
1. 楼上所说，去掉那行代码。
2. 不去那行代码，改成下面的形式
import gzip

content = gzip.decompress(response.read()).decode("gbk")
print(content)

3. 使用 requests （推荐）

君子好逑 发表于 2020-8-18 23:43:11

°蓝鲤歌蓝发表于 2020-8-18 23:26
因为你加了那个之后网站给你返回的是经过 GZIP 压缩后的数据，你需要先解压再解码才能呈现出真正的内容。 ...

谢谢大佬解惑{:10_256:}

页: [1]

鱼C论坛's Archiver

爬虫