[已解决]无法提取该网页的内容

python羊 · 发表于 2020-6-1 11:12:54

我想要提取该网站的  颜色等级  的 “G”，
但是报错了，请问应该怎么做？

报错内容：
Traceback (most recent call last):
  File "C:/Users/查询2.py", line 6, in <module>
response = urllib.request.urlopen(url).read()
  File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
  File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
  File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
  File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
  File "C:\Users\PC-19\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
>>>

————————————————————————————————————————————————————————————————
原代码：

import urllib.request
import re
number = '6352100549'
url = 'https://www.gia.edu/CN/report-check?reportno='+ number

response = urllib.request.urlopen(url).read()

response = response.decode('utf-8')

color = re.findall('<strong class="dynamic" id="COLOR">(\w)</strong>',response)

print(color)
——————————————————————————————————————————————————————————————

最佳答案

月排行榜 / 总排行榜

Twilight6

2020-6-1 11:12:55

from urllib.request import Request,urlopen

import re

number = '6352100549'

url = 'https://www.gia.edu/CN/report-check?reportno='+ number

headers = {

'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'

}

request = Request(url,headers=headers)

response = urlopen(request).read()

response = response.decode('utf-8')

print(response)
复制代码

你的正则爬不到数据

跳转到最佳答案楼层

Twilight6 · 发表于 2020-6-1 11:12:55

这个最佳答案由 Twilight6 给出，感谢 Twilight6 的回答。

单击隐藏图章

from urllib.request import Request,urlopen

import re

number = '6352100549'

url = 'https://www.gia.edu/CN/report-check?reportno='+ number

headers = {

'User-Agent':'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36'

}

request = Request(url,headers=headers)

response = urlopen(request).read()

response = response.decode('utf-8')

print(response)
复制代码

你的正则爬不到数据

Twilight6 · 发表于 2020-6-1 11:15:23

连 UA 都没加，直接被反爬了呗

suchocolate · 发表于 2020-6-1 12:35:26

color = re.findall(r'<COLOR>(\w)</COLOR>', response)
print(color)

复制代码

python羊 · 发表于 2020-6-1 12:53:50

suchocolate 发表于 2020-6-1 12:35

感谢回复，的确和网页的审查元素时的正则不同，
需要重新写

python羊 · 发表于 2020-6-1 12:54:24

Twilight6 发表于 2020-6-1 11:12
你的正则爬不到数据

感谢，才入门很多不懂。

账号		自动登录	找回密码
密码			立即注册

[已解决]无法提取该网页的内容

最佳答案

浏览过的版块