|
发表于 2020-7-1 22:23:18
|
显示全部楼层
你写入文件的编码不一致,在这里改下就好了,把你爬出来的网页编码同时设置为文件编码即可:
open('url_%d.txt'%num,'w',encoding=encoding) as file:
- import chardet
- from urllib import request
- with open('urls.txt','r') as urls:
- num = 0
- while 1:
- url = urls.readline()
- if url :
- num += 1
- response = request.urlopen(url).read()
- encoding = chardet.detect(response)['encoding']
- encoding = 'GBK' if encoding=='GB2312' else encoding
- response = response.decode(encoding)
- with open('url_%d.txt'%num,'w',encoding=encoding) as file:
- file.write(response)
- else:break
复制代码
只是有可能会部分乱码
|
|