python羊 发表于 2020-6-1 11:12:54

无法提取该网页的内容

我想要提取 该网站的颜色等级的 “G”,
但是报错了,请问应该怎么做?

报错内容:
Traceback (most recent call last):
File "C:/Users/查询2.py", line 6, in <module>
    response = urllib.request.urlopen(url).read()
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 531, in open
    response = meth(req, response)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 640, in http_response
    response = self.parent.error(
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 569, in error
    return self._call_chain(*args)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 649, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
>>>

————————————————————————————————————————————————————————————————
原代码:

import urllib.request
import re
number = '6352100549'
url = 'https://www.gia.edu/CN/report-check?reportno='+ number

response = urllib.request.urlopen(url).read()

response = response.decode('utf-8')

color = re.findall('<strong class="dynamic" id="COLOR">(\w)</strong>',response)

print(color)
——————————————————————————————————————————————————————————————

Twilight6 发表于 2020-6-1 11:12:55

from urllib.request import Request,urlopen
import re
number = '6352100549'
url = 'https://www.gia.edu/CN/report-check?reportno='+ number
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'
}

request = Request(url,headers=headers)
response = urlopen(request).read()
response = response.decode('utf-8')
print(response)
你的正则爬不到数据

Twilight6 发表于 2020-6-1 11:15:23

连UA 都没加,直接被反爬了呗

suchocolate 发表于 2020-6-1 12:35:26

color = re.findall(r'<COLOR>(\w)</COLOR>', response)
print(color)

python羊 发表于 2020-6-1 12:53:50

suchocolate 发表于 2020-6-1 12:35


感谢回复,的确和网页的审查元素时的正则不同,
需要重新写

python羊 发表于 2020-6-1 12:54:24

Twilight6 发表于 2020-6-1 11:12
你的正则爬不到数据

感谢,才入门很多不懂。
页: [1]
查看完整版本: 无法提取该网页的内容