无法提取该网页的内容,Python交流,编程语言专区,鱼C论坛

python羊 发表于 2020-6-1 11:12:54

无法提取该网页的内容

我想要提取该网站的颜色等级的 “G”，
但是报错了，请问应该怎么做？

报错内容：
Traceback (most recent call last):
File "C:/Users/查询2.py", line 6, in <module>
response = urllib.request.urlopen(url).read()
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
>>>

————————————————————————————————————————————————————————————————
原代码：

import urllib.request
import re
number = '6352100549'
url = 'https://www.gia.edu/CN/report-check?reportno='+ number

response = urllib.request.urlopen(url).read()

response = response.decode('utf-8')

color = re.findall('<strong class="dynamic" id="COLOR">(\w)</strong>',response)

print(color)
——————————————————————————————————————————————————————————————

Twilight6 发表于 2020-6-1 11:12:55

from urllib.request import Request,urlopen
import re
number = '6352100549'
url = 'https://www.gia.edu/CN/report-check?reportno='+ number
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'
}

request = Request(url,headers=headers)
response = urlopen(request).read()
response = response.decode('utf-8')
print(response)
你的正则爬不到数据

Twilight6 发表于 2020-6-1 11:15:23

连UA 都没加，直接被反爬了呗

suchocolate 发表于 2020-6-1 12:35:26

color = re.findall(r'<COLOR>(\w)</COLOR>', response)
print(color)

python羊 发表于 2020-6-1 12:53:50

suchocolate 发表于 2020-6-1 12:35

感谢回复，的确和网页的审查元素时的正则不同，
需要重新写

python羊 发表于 2020-6-1 12:54:24

Twilight6 发表于 2020-6-1 11:12
你的正则爬不到数据

感谢，才入门很多不懂。

页: [1]

鱼C论坛's Archiver

无法提取该网页的内容