求助求助!关于在贴吧上爬取图片的问题!
上了小甲鱼的爬虫入门课,学着他在百度贴吧上爬图片下来,源代码基本都一样,但是最后报错了urllib.error.URLError: <urlopen error unknown url type: "http>
网上说是缺啥ssl模块,但是我研究了好久实在是不知道咋整, 网上说是缺啥ssl模块,但是我研究了好久实在是不知道咋整
python -m pip install ssl -i https://pypi.tuna.tsinghua.edu.cn/simple
你安装下 ssl 模块 然后在文件最前一行导入ssl模块import ssl 然后试试 Twilight6 发表于 2020-6-6 16:46
你安装下 ssl 模块 然后在文件最前一行导入ssl模块 然后试试
C:\Users\86738\Desktop\999de2d7424fd2482b827448d789999.png
安装不了诶 报错 ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output. 报错截图不全,把代码和摆错都贴上来。 本帖最后由 Twilight6 于 2020-6-6 17:49 编辑
Ericwooooo0622 发表于 2020-6-6 16:55
安装不了诶 报错 ERROR: Command errored out with exit status 1: python setup.py egg_info Check...
升级下这两个 然后重复 2L操作
python -m pip install --upgrade pip
python -m pip install --upgrade setuptools suchocolate 发表于 2020-6-6 17:38
报错截图不全,把代码和摆错都贴上来。
PyDev console: starting.
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) on win32
runfile('F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py', wdir='F:/python/课后习题/第53讲 爬虫')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "E:\Program Files\JetBrains\PyCharm Community Edition 2019.3.3\plugins\python-ce\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars)# execute the script
File "E:\Program Files\JetBrains\PyCharm Community Edition 2019.3.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py", line 21, in <module>
get_img(open_url(url))
File "F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py", line 17, in get_img
urllib.request.urlretrieve(each, filename, None)
File "C:\Program Files\Python38\lib\urllib\request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Program Files\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Program Files\Python38\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Program Files\Python38\lib\urllib\request.py", line 547, in _open
return self._call_chain(self.handle_open, 'unknown',
File "C:\Program Files\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Program Files\Python38\lib\urllib\request.py", line 1390, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: "http>
这个是报错的全部代码了 Twilight6 发表于 2020-6-6 17:47
升级下这两个 然后重复 2L操作
我刚刚试过这个方法了,但还是没法安装T T Ericwooooo0622 发表于 2020-6-6 18:05
PyDev console: starting.
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46)
发代码!!! suchocolate 发表于 2020-6-6 20:14
发代码!!!
啊抱歉哈哈哈
import urllib.request
import re
def open_url(url):
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.44')
html = page.read().decode('utf-8')
return html
def get_img(html):
p = re.compile(r'<img class="BDE_Image" src=("[^"]*\.jpg)"')
img_list = re.findall(p,html)
for each in img_list:
filename = each.split('/')[-1]
urllib.request.urlretrieve(each, filename, None)
if __name__ == '__main__':
url = 'https://tieba.baidu.com/p/6591896494'
get_img(open_url(url))
这是全部的代码 本帖最后由 suchocolate 于 2020-6-7 10:31 编辑
Ericwooooo0622 发表于 2020-6-7 09:46
啊抱歉哈哈哈
import urllib.request
import re
1.报错说第17行说未知的url类型:urllib.error.URLError: <urlopen error unknown url type: "http>
2.也就是报错里提示的:
File "F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py", line 17, in get_img
urllib.request.urlretrieve(each, filename, None)
3.估计是each的url不对,用print打印一下:
for each in img_list:
filename = each.split('/')[-1]
print(each)
exit(0)
urllib.request.urlretrieve(each, filename, None)
4.运行后输出:
"http://tiebapic.baidu.com/forum/w%3D580/sign=d01e5ef2ce33c895a67e9873e1127397/4a408c01a18b87d6f845827a100828381f30fd29.jpg
5.问题就在开头的双引号,说明re的时候匹配多双引号了,那就双引号移到括号外面:
p = re.compile(r'<img class="BDE_Image" src="(.*?\.jpg)"')
6.再次输出ok了:
http://tiebapic.baidu.com/forum/ ... 00828381f30fd29.jpg suchocolate 发表于 2020-6-7 10:25
1.报错说第17行说未知的url类型:urllib.error.URLError:
十分感谢!问题解决啦!
页:
[1]