|
15鱼币
- data = open('123.html','r',encoding = 'utf-8')
- print(data)
- for line in data:
- print(line)
复制代码
这里编码为utf-8读取个别html文件,没有问题,但是读取其他html文件报错如下(例如123.html):
File "D:\python37\lib\codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 317: invalid continuation byte
若是将utf-8去掉这时读取没有问题,但是前期可以读取的html又报以下错误
UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 178: illegal multibyte sequence
问题:想求助有没有什么编码转换的办法或其他方法读取不同类型html文件,不会报错
本帖最后由 XiaoPaiShen 于 2019-10-17 00:30 编辑
- import chardet
- with open('123.html', 'rb') as file:
- rawdata = file.read()
- result = chardet.detect(rawdata)
- charenc = result['encoding']
- print(charenc)
- print(rawdata.decode(charenc))
复制代码
|
|