|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
目标:爬取这个页面上的电影名称
https://movie.douban.com/cinema/nowplaying/shanghai/
我的代码如下:
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
url = requests.get('https://movie.douban.com/cinema/nowplaying/shanghai/',headers=headers)
res = requests.get(url).text
content = BeautifulSoup(res, "html.parser")
data = content.find_all('li', attrs={'stitle': 'title'})
cinema_list = []
for d in data:
plist = d.find('name')['title']
cinema_list.append(plist)
print(cinema_list)
以下是报错信息:
E:\anaconda3\python.exe C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py
Traceback (most recent call last):
File "E:\anaconda3\lib\site-packages\requests\models.py", line 380, in prepare_url
scheme, auth, host, port, path, query, fragment = parse_url(url)
File "E:\anaconda3\lib\site-packages\urllib3\util\url.py", line 392, in parse_url
return six.raise_from(LocationParseError(source_url), None)
File "<string>", line 3, in raise_from
urllib3.exceptions.LocationParseError: Failed to parse: <Response [200]>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py", line 1690, in <module>
res = requests.get(url).text
File "E:\anaconda3\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "E:\anaconda3\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "E:\anaconda3\lib\site-packages\requests\sessions.py", line 516, in request
prep = self.prepare_request(req)
File "E:\anaconda3\lib\site-packages\requests\sessions.py", line 449, in prepare_request
p.prepare(
File "E:\anaconda3\lib\site-packages\requests\models.py", line 314, in prepare
self.prepare_url(url, params)
File "E:\anaconda3\lib\site-packages\requests\models.py", line 382, in prepare_url
raise InvalidURL(*e.args)
requests.exceptions.InvalidURL: Failed to parse: <Response [200]>
Process finished with exit code 1
新手小白求助大佬,帮忙解答一下,万分感谢!
本帖最后由 isdkz 于 2023-4-19 12:27 编辑
我对原始代码进行了以下修改:
1、修改了获取网页内容的方式:在原始代码中,您首先使用requests.get()获取了一个响应对象,然后又尝试使用requests.get()获取该响应对象的文本内容。
这是不正确的,因为您已经得到了响应对象。在修改后的代码中,我直接在第一次调用requests.get()时,传入URL和请求头,并使用.text属性获取响应文本内容。
原始代码:url = requests.get('https://movie.douban.com/cinema/nowplaying/shanghai/',headers=headers)
res = requests.get(url).text
修改后的代码:url = 'https://movie.douban.com/cinema/nowplaying/shanghai/'
res = requests.get(url, headers=headers).text
2、修改了查找影片信息的方式:原始代码中,您试图查找具有stitle属性的li元素。但实际上,这些li元素的类名为stitle。
因此,我将查找条件从attrs={'stitle': 'title'}更改为attrs={'class': 'stitle'}。
原始代码:data = content.find_all('li', attrs={'stitle': 'title'})
修改后的代码:data = content.find_all('li', attrs={'class': 'stitle'})
3、修改了从每个li元素中提取影片名称的方式:原始代码中,您试图从name元素的title属性中提取影片名称。
然而,实际上,影片名称位于a元素的title属性中。因此,我将提取方式从d.find('name')['title']更改为d.find('a')['title']。
原始代码:plist = d.find('name')['title']
修改后的代码:plist = d.find('a')['title']
经过上述修改后,代码应该可以正确运行并输出当前上海正在上映的电影列表。
修改后的完整代码:import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
url = 'https://movie.douban.com/cinema/nowplaying/shanghai/'
res = requests.get(url, headers=headers).text
content = BeautifulSoup(res, "html.parser")
data = content.find_all('li', attrs={'class': 'stitle'})
cinema_list = []
for d in data:
plist = d.find('a')['title']
cinema_list.append(plist)
print(cinema_list)
|
|