鱼C论坛

 找回密码
 立即注册
查看: 560|回复: 5

[已解决]新手小白学习爬虫,遇到报错问题一大堆,求大佬帮忙解答一下,万分感谢~~~

[复制链接]
发表于 2023-4-19 12:00:26 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
目标:爬取这个页面上的电影名称
https://movie.douban.com/cinema/nowplaying/shanghai/

我的代码如下:
import requests
from bs4 import BeautifulSoup

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
url = requests.get('https://movie.douban.com/cinema/nowplaying/shanghai/',headers=headers)
res = requests.get(url).text
content = BeautifulSoup(res, "html.parser")
data = content.find_all('li', attrs={'stitle': 'title'})
cinema_list = []
for d in data:
    plist = d.find('name')['title']
    cinema_list.append(plist)
print(cinema_list)


以下是报错信息:
E:\anaconda3\python.exe C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py
Traceback (most recent call last):
  File "E:\anaconda3\lib\site-packages\requests\models.py", line 380, in prepare_url
    scheme, auth, host, port, path, query, fragment = parse_url(url)
  File "E:\anaconda3\lib\site-packages\urllib3\util\url.py", line 392, in parse_url
    return six.raise_from(LocationParseError(source_url), None)
  File "<string>", line 3, in raise_from
urllib3.exceptions.LocationParseError: Failed to parse: <Response [200]>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py", line 1690, in <module>
    res = requests.get(url).text
  File "E:\anaconda3\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "E:\anaconda3\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\anaconda3\lib\site-packages\requests\sessions.py", line 516, in request
    prep = self.prepare_request(req)
  File "E:\anaconda3\lib\site-packages\requests\sessions.py", line 449, in prepare_request
    p.prepare(
  File "E:\anaconda3\lib\site-packages\requests\models.py", line 314, in prepare
    self.prepare_url(url, params)
  File "E:\anaconda3\lib\site-packages\requests\models.py", line 382, in prepare_url
    raise InvalidURL(*e.args)
requests.exceptions.InvalidURL: Failed to parse: <Response [200]>

Process finished with exit code 1


新手小白求助大佬,帮忙解答一下,万分感谢!
最佳答案
2023-4-19 12:06:37
本帖最后由 isdkz 于 2023-4-19 12:27 编辑

我对原始代码进行了以下修改:

1、修改了获取网页内容的方式:在原始代码中,您首先使用requests.get()获取了一个响应对象,然后又尝试使用requests.get()获取该响应对象的文本内容。

这是不正确的,因为您已经得到了响应对象。在修改后的代码中,我直接在第一次调用requests.get()时,传入URL和请求头,并使用.text属性获取响应文本内容。

原始代码:
url = requests.get('https://movie.douban.com/cinema/nowplaying/shanghai/',headers=headers)
res = requests.get(url).text

修改后的代码:
url = 'https://movie.douban.com/cinema/nowplaying/shanghai/'
res = requests.get(url, headers=headers).text

2、修改了查找影片信息的方式:原始代码中,您试图查找具有stitle属性的li元素。但实际上,这些li元素的类名为stitle。

因此,我将查找条件从attrs={'stitle': 'title'}更改为attrs={'class': 'stitle'}。

原始代码:
data = content.find_all('li', attrs={'stitle': 'title'})

修改后的代码:
data = content.find_all('li', attrs={'class': 'stitle'})

3、修改了从每个li元素中提取影片名称的方式:原始代码中,您试图从name元素的title属性中提取影片名称。

然而,实际上,影片名称位于a元素的title属性中。因此,我将提取方式从d.find('name')['title']更改为d.find('a')['title']。

原始代码:
plist = d.find('name')['title']

修改后的代码:
plist = d.find('a')['title']

经过上述修改后,代码应该可以正确运行并输出当前上海正在上映的电影列表。


修改后的完整代码:
import requests
from bs4 import BeautifulSoup

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
url = 'https://movie.douban.com/cinema/nowplaying/shanghai/'
res = requests.get(url, headers=headers).text
content = BeautifulSoup(res, "html.parser")
data = content.find_all('li', attrs={'class': 'stitle'})
cinema_list = []
for d in data:
    plist = d.find('a')['title']
    cinema_list.append(plist)
print(cinema_list)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2023-4-19 12:06:37 | 显示全部楼层    本楼为最佳答案   
本帖最后由 isdkz 于 2023-4-19 12:27 编辑

我对原始代码进行了以下修改:

1、修改了获取网页内容的方式:在原始代码中,您首先使用requests.get()获取了一个响应对象,然后又尝试使用requests.get()获取该响应对象的文本内容。

这是不正确的,因为您已经得到了响应对象。在修改后的代码中,我直接在第一次调用requests.get()时,传入URL和请求头,并使用.text属性获取响应文本内容。

原始代码:
url = requests.get('https://movie.douban.com/cinema/nowplaying/shanghai/',headers=headers)
res = requests.get(url).text

修改后的代码:
url = 'https://movie.douban.com/cinema/nowplaying/shanghai/'
res = requests.get(url, headers=headers).text

2、修改了查找影片信息的方式:原始代码中,您试图查找具有stitle属性的li元素。但实际上,这些li元素的类名为stitle。

因此,我将查找条件从attrs={'stitle': 'title'}更改为attrs={'class': 'stitle'}。

原始代码:
data = content.find_all('li', attrs={'stitle': 'title'})

修改后的代码:
data = content.find_all('li', attrs={'class': 'stitle'})

3、修改了从每个li元素中提取影片名称的方式:原始代码中,您试图从name元素的title属性中提取影片名称。

然而,实际上,影片名称位于a元素的title属性中。因此,我将提取方式从d.find('name')['title']更改为d.find('a')['title']。

原始代码:
plist = d.find('name')['title']

修改后的代码:
plist = d.find('a')['title']

经过上述修改后,代码应该可以正确运行并输出当前上海正在上映的电影列表。


修改后的完整代码:
import requests
from bs4 import BeautifulSoup

headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
url = 'https://movie.douban.com/cinema/nowplaying/shanghai/'
res = requests.get(url, headers=headers).text
content = BeautifulSoup(res, "html.parser")
data = content.find_all('li', attrs={'class': 'stitle'})
cinema_list = []
for d in data:
    plist = d.find('a')['title']
    cinema_list.append(plist)
print(cinema_list)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2023-4-19 13:38:39 | 显示全部楼层
isdkz 发表于 2023-4-19 12:06
我对原始代码进行了以下修改:

1、修改了获取网页内容的方式:在原始代码中,您首先使用requests.get() ...

我修改了代码,但是还是有报错,这是为何?
修改后代码:
import requests
from bs4 import BeautifulSoup


headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
url = requests.get('https://movie.douban.com/cinema/nowplaying/shanghai/',headers=headers)
res = requests.get(url, headers=headers).text
content = BeautifulSoup(res, "html.parser")
data = content.find_all('li', attrs={'class': 'stitle'})
cinema_list = []
for d in data:
    plist = d.find('a')['title']
    cinema_list.append(plist)
print(cinema_list)


报错信息如下:
E:\anaconda3\python.exe C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py
Traceback (most recent call last):
  File "E:\anaconda3\lib\site-packages\requests\models.py", line 380, in prepare_url
    scheme, auth, host, port, path, query, fragment = parse_url(url)
  File "E:\anaconda3\lib\site-packages\urllib3\util\url.py", line 392, in parse_url
    return six.raise_from(LocationParseError(source_url), None)
  File "<string>", line 3, in raise_from
urllib3.exceptions.LocationParseError: Failed to parse: <Response [200]>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py", line 1690, in <module>
    res = requests.get(url, headers=headers).text
  File "E:\anaconda3\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "E:\anaconda3\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "E:\anaconda3\lib\site-packages\requests\sessions.py", line 516, in request
    prep = self.prepare_request(req)
  File "E:\anaconda3\lib\site-packages\requests\sessions.py", line 449, in prepare_request
    p.prepare(
  File "E:\anaconda3\lib\site-packages\requests\models.py", line 314, in prepare
    self.prepare_url(url, params)
  File "E:\anaconda3\lib\site-packages\requests\models.py", line 382, in prepare_url
    raise InvalidURL(*e.args)
requests.exceptions.InvalidURL: Failed to parse: <Response [200]>

Process finished with exit code 1

想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2023-4-19 13:51:34 | 显示全部楼层
liubulong 发表于 2023-4-19 13:38
我修改了代码,但是还是有报错,这是为何?
修改后代码:
import requests


你还是没改呀,我一共改了三个地方,你就改了两个

import requests
from bs4 import BeautifulSoup


headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
url = requests.get('https://movie.douban.com/cinema/nowplaying/shanghai/',headers=headers)    # 这里也要改
res = requests.get(url, headers=headers).text
content = BeautifulSoup(res, "html.parser")
data = content.find_all('li', attrs={'class': 'stitle'})
cinema_list = []
for d in data:
    plist = d.find('a')['title']
    cinema_list.append(plist)
print(cinema_list)
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2023-4-19 14:29:18 | 显示全部楼层
isdkz 发表于 2023-4-19 13:51
你还是没改呀,我一共改了三个地方,你就改了两个

import requests

嗯,是我粗心大意了,不好意思,谢谢大佬指教~~万分感谢~~
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2023-4-19 14:40:26 | 显示全部楼层
liubulong 发表于 2023-4-19 14:29
嗯,是我粗心大意了,不好意思,谢谢大佬指教~~万分感谢~~

我现在想把他保存在excel表格中,但是报错了,麻烦再帮忙解答一些,谢谢

代码如下:
import requests
from bs4 import BeautifulSoup
import openpyxl

def open_url(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
    html = requests.get(url, headers=headers).text
    return html

def get_data(html):
    content = BeautifulSoup(html, "html.parser")
    data = content.find_all('li', attrs={'class': 'stitle'})
    cinema_list = []
    for d in data:
        plist = d.find('a')['title']
        cinema_list.append(plist)
        return cinema_list

def save_as_excel(mylist):
    wb = openpyxl.Workbook()
    ws = wb.active

    ws['A1'] = '电影'

    for d in mylist:
        ws.append(d)

    wb.save('2023电影排行.xlsx')

def main():
    url = 'https://movie.douban.com/cinema/nowplaying/shanghai/'
    html = open_url(url)
    mylist = get_data(html)
    save_as_excel(mylist)

if __name__ == '__main__':
    main()

报错信息:
E:\anaconda3\python.exe C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py
Traceback (most recent call last):
  File "C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py", line 1719, in <module>
    main()
  File "C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py", line 1716, in main
    save_as_excel(mylist)
  File "C:/Users/lyl/PycharmProjects/pythonProject1/dangdang/dangdang/spiders/abc.py", line 1708, in save_as_excel
    ws.append(d)
  File "E:\anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 675, in append
    self._invalid_row(iterable)
  File "E:\anaconda3\lib\site-packages\openpyxl\worksheet\worksheet.py", line 811, in _invalid_row
    raise TypeError('Value must be a list, tuple, range or generator, or a dict. Supplied value is {0}'.format(
TypeError: Value must be a list, tuple, range or generator, or a dict. Supplied value is <class 'str'>

Process finished with exit code 1

想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-12-22 19:52

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表