小爬虫中 utf-8编码不成功,Python交流,编程语言专区,鱼C论坛

山沟流水 发表于 2017-12-6 08:46:49

小爬虫中 utf-8编码不成功

即使是和小甲鱼一模一样的代码，也是出现编码错误，如下：
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 251: invalid start byte
其他的简单爬虫也有类似问题，我在网查了下说的编码格式不对，需要转换成相应的格式，但是我把编码格式修改成‘UTF-8’还是出现错误，求诸位给解答下，代码如下：

import urllib.request
import re

def open_url(url):
req=urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36')
page= urllib.request.urlopen(req)
html= page.read().decode('utf-8')
#print(html)
return html

def get_ip(html):
#p=r'(?:(?:?\d?\d|2\d|25)\.){3}(?:?\d?\d|2\d|25)'
p=r'<img src="([^"].jpg)'
imglist=re.findall(p,html)
for each in imglist:
print(each)

if __name__=='__main__':
#url='http://www.kuaidaili.com/free'

url='http://www.lanrentuku.com/vector/design'
open_url(url)
#get_ip(open_url(url))

°蓝鲤歌蓝 发表于 2017-12-6 10:13:27

这个我遇到过，可能是你想要获取的网页是以压缩的形式发送给你的，所以你需要先解压再编码，不解压的话会出现这个错误
不过你最好是把报错代码发出来，才能知道具体报错位置。

山沟流水 发表于 2017-12-6 22:09:58

°蓝鲤歌蓝发表于 2017-12-6 10:13
这个我遇到过，可能是你想要获取的网页是以压缩的形式发送给你的，所以你需要先解压再编码，不解压的话会出 ...

Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\自动获取IP.py", line 23, in <module>
open_url(url)
File "C:\Users\Administrator\Desktop\自动获取IP.py", line 8, in open_url
html= page.read().decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 251: invalid start byte
这个是返回的错误信息，之前写的别的代码也有这种问题

山沟流水 发表于 2017-12-6 22:17:36

°蓝鲤歌蓝发表于 2017-12-6 10:13
这个我遇到过，可能是你想要获取的网页是以压缩的形式发送给你的，所以你需要先解压再编码，不解压的话会出 ...

如果确实是因为压缩导致的，那该怎么解压？刚接触不长时间，不懂的太多了，希望不要见怪{:5_91:}

chakyam 发表于 2017-12-6 22:48:00

html= page.read().decode('gbk')

°蓝鲤歌蓝 发表于 2017-12-6 23:22:02

山沟流水发表于 2017-12-6 22:17
如果确实是因为压缩导致的，那该怎么解压？刚接触不长时间，不懂的太多了，希望不要见怪

https://zhuanlan.zhihu.com/p/25095566
这里面写了，你好好看看应该就知道了。

山沟流水 发表于 2017-12-7 07:44:13

°蓝鲤歌蓝发表于 2017-12-6 23:22
https://zhuanlan.zhihu.com/p/25095566
这里面写了，你好好看看应该就知道了。

链接中说的就是我这种情况，但是按照给出的第一种方法，得出了
Traceback (most recent call last):
File "C:\Users\Administrator\Desktop\自动获取IP.py", line 25, in <module>
open_url(url)
File "C:\Users\Administrator\Desktop\自动获取IP.py", line 10, in open_url
html=gzip.decompress(html).decode('utf-8')
File "E:\python37\lib\gzip.py", line 532, in decompress
return f.read()
File "E:\python37\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "E:\python37\lib\gzip.py", line 463, in read
if not self._read_gzip_header():
File "E:\python37\lib\gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'<!')
说明这又不是一个 gzip压缩文件，这又是咋回事{:5_94:}

°蓝鲤歌蓝 发表于 2017-12-7 10:23:24

你的这个网址我刚刚过去看了，网站返回的确是gzip格式的数据

山沟流水 发表于 2017-12-7 11:42:28

°蓝鲤歌蓝发表于 2017-12-7 10:23
你的这个网址我刚刚过去看了，网站返回的确是gzip格式的数据

对啊，我查的也是gzip形式的，那估计就是那链接的方法有某种限制条件。总之，谢谢帮助

°蓝鲤歌蓝 发表于 2017-12-7 11:53:54

山沟流水发表于 2017-12-7 11:42
对啊，我查的也是gzip形式的，那估计就是那链接的方法有某种限制条件。总之，谢谢帮助

你能把改过的代码发出来看看吗？

山沟流水 发表于 2017-12-7 13:33:41

°蓝鲤歌蓝发表于 2017-12-7 11:53
你能把改过的代码发出来看看吗？

def open_url(url):

req=urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36')

page= urllib.request.urlopen(req)
html= page.read()
html=gzip.decompress(html).decode('utf-8')
print(html)
就这个函数改动了下，别的没动

°蓝鲤歌蓝 发表于 2017-12-7 13:58:36

解决了

山沟流水 发表于 2017-12-7 14:49:00

°蓝鲤歌蓝发表于 2017-12-7 13:58
解决了

我以为是我装的版本有问题，刚刚卸载重装了个3.6，还是一样的错误。现在错误应该是我的电脑这边哪里出问题了

°蓝鲤歌蓝 发表于 2017-12-7 15:09:51

山沟流水发表于 2017-12-7 14:49
我以为是我装的版本有问题，刚刚卸载重装了个3.6，还是一样的错误。现在错误应该是我的电脑这边哪里出问 ...

不是电脑问题，是网页编码格式是GBK，而且发送给你的是压缩的，所以是先解压然后decode('GBK')就可以了

山沟流水 发表于 2017-12-7 15:23:56

°蓝鲤歌蓝发表于 2017-12-7 15:09
不是电脑问题，是网页编码格式是GBK，而且发送给你的是压缩的，所以是先解压然后decode('GBK')就可以了

我试了按照GBK也是这个错误。
我这有疑问：第一，刚刚你用我那个代码成功打印了HTML，说明代码没有问题（如果修改了，请指点下）；第二，我的代码错误提示是OSerror，这个错误也会因为格式不一致吗？

山沟流水 发表于 2017-12-7 15:34:22

chakyam 发表于 2017-12-6 22:48
html= page.read().decode('gbk')

还是一样的错误提示

°蓝鲤歌蓝 发表于 2017-12-7 16:38:18

山沟流水发表于 2017-12-7 15:34
还是一样的错误提示

我没有改哦，只是没有写成函数而已，你是解压再读取的吗？
本来想发图的，不过今天发图次数满了。
就是下面的代码，我刚刚又运行了一遍，还是没问题，可以读出来。

import urllib.request

url='http://www.lanrentuku.com/vector/design'
req=urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36')
page= urllib.request.urlopen(url)
html= page.read().decode('gbk')
#html=gzip.decompress(html).decode('utf-8')
print(html)

神棍呐 发表于 2017-12-23 21:15:54

chakyam 发表于 2017-12-6 22:48
html= page.read().decode('gbk')

你的回答解决了我的问题。

vction82 发表于 2021-2-25 09:32:47

神棍呐发表于 2017-12-23 21:15
你的回答解决了我的问题。

请教如何解决的，我也遇到类似问题。读取的是从LINUX 的FTP服务器上下载的 gzip文件( .tar.gz文件) 读取到文件末尾行时就报错。

代码段如下：
for temp_name in lst_path:
   lst_path1 =tar_path + temp_name
   oldFileName = gzip.open(lst_path1, 'r+')    # 不解压直接读取 tar.gz 文件
   #oldFileName = open(lst_path1, 'r+', encoding='utf-8') # 读取 txt 的 MML 结果文件
   Textline=

   whilenot Textline=="":
         print(Textline )
         Textline = bytes.decode(oldFileName.readline(),encoding="utf-8") # 不解压直接读取 tar.gz 文件，并转码 #出错行

         #print(Textline)
         Textline1 = ''.join(Textline).strip()
         print(Textline1)
         ifTextline1== "" or (not "命令-----" in Textline1):
            continue
         FUNC = LST_DICT
         FUNC(Textline1, Textline.replace(" ", "_").replace(":;\r\n", "").lower())

报错如下：
Traceback (most recent call last):
File "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.2.3\helpers\pydev\pydevd.py", line 1580, in <module>
globals = debugger.run(setup['file'], None, None, is_module)
File "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.2.3\helpers\pydev\pydevd.py", line 964, in run
pydev_imports.execfile(file, globals, locals)# execute the script
File "C:\Program Files (x86)\JetBrains\PyCharm Community Edition 2016.2.3\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/zhangweili/PycharmProjects/LSTNR/datafx_js_v2.py", line 1438, in <module>
Textline = bytes.decode(oldFileName.readline(),encoding="utf-8") # 不解压直接读取 tar.gz 文件，并转码
File "C:\Users\zhangweili\AppData\Local\Programs\Python\Python37\lib\gzip.py", line 374, in readline
return self._buffer.readline(size)
File "C:\Users\zhangweili\AppData\Local\Programs\Python\Python37\lib\_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "C:\Users\zhangweili\AppData\Local\Programs\Python\Python37\lib\gzip.py", line 463, in read
if not self._read_gzip_header():
File "C:\Users\zhangweili\AppData\Local\Programs\Python\Python37\lib\gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x99\x12')

龙舞九天 发表于 2021-5-11 05:29:49

{:5_94:}

页: [1]

鱼C论坛's Archiver

小爬虫中 utf-8编码不成功