|
发表于 2019-1-7 14:29:02
|
显示全部楼层
本帖最后由 ba21 于 2019-1-7 15:15 编辑
方法一:使用requests模块,自带gzip解压
- import requests
- url = 'https://dg.zu.fang.com/house/s31/'
- res = requests.get(url)
- content= res.text # 获取文本,实际上有些内容编码还是不正确
复制代码 >>> type(res.text)
<class 'str'>
>>> type(res.content)
<class 'bytes'>
为了能正确使用,还是得要进行编码解码
- import requests
- import cchardet
- url = 'https://dg.zu.fang.com/house/s31/'
- res = requests.get(url)
- content= res.content # 获取二进制数据
- # 获取编码
- enc = cchardet.detect(content)
- enc = enc['encoding']
- print(enc)
- content = content.decode(enc) #解码
- with open('tttt.txt',"w") as f:
- f.write(content)
复制代码
方法二:写代码解压
- import urllib.request
- import gzip
- import cchardet
- import io
- url = 'https://dg.zu.fang.com/house/s31/'
- req = urllib.request.Request(url)
- data = urllib.request.urlopen(req)
- encoding = data.getheader('Content-Encoding')
- content = data.read()
- # 由于网页返回的是gzip压缩后的数据,所以先解压
- if encoding == 'gzip':
- buf = io.BytesIO(content)
- gf = gzip.GzipFile(fileobj=buf)
- content = gf.read()
复制代码
- # 方案一:
- # 使用cchardet获取编码,以便解码
- enc = cchardet.detect(content)
- enc = enc['encoding']
- print(enc)
- content = content.decode(enc) #解码
- with open('tttt.txt',"w") as f:
- f.write(content)
复制代码
- #方案二:
- #直接2进制写入
- with open('tttt.txt',"wb") as fb:
- fb.write(content)
复制代码
|
|