[已解决]求助求助！关于在贴吧上爬取图片的问题！

Ericwooooo0622 · 发表于 2020-6-6 16:36:23

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

上了小甲鱼的爬虫入门课，学着他在百度贴吧上爬图片下来，源代码基本都一样，但是最后报错了

urllib.error.URLError: <urlopen error unknown url type: "http>

网上说是缺啥ssl模块，但是我研究了好久实在是不知道咋整，

最佳答案

月排行榜 / 总排行榜

suchocolate

2020-6-7 10:25:09

本帖最后由 suchocolate 于 2020-6-7 10:31 编辑

Ericwooooo0622 发表于 2020-6-7 09:46
啊抱歉哈哈哈
import urllib.request
import re

1.报错说第17行说未知的url类型：urllib.error.URLError: <urlopen error unknown url type: "http>
2.也就是报错里提示的：
  File "F:/python/课后习题/第53讲爬虫/2020.6.6 从贴吧上爬取图片.py", line 17, in get_img
urllib.request.urlretrieve(each, filename, None)
3.估计是each的url不对，用print打印一下：
for each in img_list:
      filename = each.split('/')[-1]
      print(each)
      exit(0)
      urllib.request.urlretrieve(each, filename, None)
4.运行后输出：
"http://tiebapic.baidu.com/forum/w%3D580/sign=d01e5ef2ce33c895a67e9873e1127397/4a408c01a18b87d6f845827a100828381f30fd29.jpg
5.问题就在开头的双引号，说明re的时候匹配多双引号了，那就双引号移到括号外面：
p = re.compile(r'<img class="BDE_Image" src="(.*?\.jpg)"')
6.再次输出ok了：
http://tiebapic.baidu.com/forum/ ... 00828381f30fd29.jpg

跳转到最佳答案楼层

Twilight6 · 发表于 2020-6-6 16:46:33

网上说是缺啥ssl模块，但是我研究了好久实在是不知道咋整
python -m pip install ssl -i https://pypi.tuna.tsinghua.edu.cn/simple
复制代码

你安装下 ssl 模块然后在文件最前一行导入ssl模块
import ssl
复制代码
然后试试

Ericwooooo0622 · 发表于 2020-6-6 16:55:53

Twilight6 发表于 2020-6-6 16:46
你安装下 ssl 模块然后在文件最前一行导入ssl模块然后试试

C:\Users\86738\Desktop\999de2d7424fd2482b827448d789999.png

安装不了诶报错 ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

suchocolate · 发表于 2020-6-6 17:38:45

报错截图不全，把代码和摆错都贴上来。

Twilight6 · 发表于 2020-6-6 17:47:56

本帖最后由 Twilight6 于 2020-6-6 17:49 编辑

Ericwooooo0622 发表于 2020-6-6 16:55
安装不了诶报错 ERROR: Command errored out with exit status 1: python setup.py egg_info Check ...

升级下这两个然后重复 2L操作

python -m pip install --upgrade pip

复制代码

python -m pip install --upgrade setuptools

复制代码

Ericwooooo0622 · 发表于 2020-6-6 18:05:08

suchocolate 发表于 2020-6-6 17:38
报错截图不全，把代码和摆错都贴上来。

PyDev console: starting.
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 bit (AMD64)] on win32
runfile('F:/python/课后习题/第53讲爬虫/2020.6.6 从贴吧上爬取图片.py', wdir='F:/python/课后习题/第53讲爬虫')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "E:\Program Files\JetBrains\PyCharm Community Edition 2019.3.3\plugins\python-ce\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "E:\Program Files\JetBrains\PyCharm Community Edition 2019.3.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "F:/python/课后习题/第53讲爬虫/2020.6.6 从贴吧上爬取图片.py", line 21, in <module>
get_img(open_url(url))
  File "F:/python/课后习题/第53讲爬虫/2020.6.6 从贴吧上爬取图片.py", line 17, in get_img
urllib.request.urlretrieve(each, filename, None)
  File "C:\Program Files\Python38\lib\urllib\request.py", line 247, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
  File "C:\Program Files\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
  File "C:\Program Files\Python38\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
  File "C:\Program Files\Python38\lib\urllib\request.py", line 547, in _open
return self._call_chain(self.handle_open, 'unknown',
  File "C:\Program Files\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
  File "C:\Program Files\Python38\lib\urllib\request.py", line 1390, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: "http>
这个是报错的全部代码了

Ericwooooo0622 · 发表于 2020-6-6 18:05:42

Twilight6 发表于 2020-6-6 17:47
升级下这两个然后重复 2L操作

我刚刚试过这个方法了，但还是没法安装T T

suchocolate · 发表于 2020-6-6 20:14:18

Ericwooooo0622 发表于 2020-6-6 18:05
PyDev console: starting.
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) [MSC v.1916 64 ...

发代码！！！

Ericwooooo0622 · 发表于 2020-6-7 09:46:31

suchocolate 发表于 2020-6-6 20:14
发代码！！！

啊抱歉哈哈哈
import urllib.request
import re

def open_url(url):
req = urllib.request.Request(url)
page = urllib.request.urlopen(req)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.44')
html = page.read().decode('utf-8')
return html

def get_img(html):
p = re.compile(r'<img class="BDE_Image" src=("[^"]*\.jpg)"')
img_list = re.findall(p,html)
for each in img_list:
filename = each.split('/')[-1]
urllib.request.urlretrieve(each, filename, None)

if __name__ == '__main__':
url = 'https://tieba.baidu.com/p/6591896494'
get_img(open_url(url))

这是全部的代码

suchocolate · 发表于 2020-6-7 10:25:09

这个最佳答案由 suchocolate 给出，感谢 suchocolate 的回答。

单击隐藏图章

本帖最后由 suchocolate 于 2020-6-7 10:31 编辑

Ericwooooo0622 发表于 2020-6-7 09:46
啊抱歉哈哈哈
import urllib.request
import re

1.报错说第17行说未知的url类型：urllib.error.URLError: <urlopen error unknown url type: "http>
2.也就是报错里提示的：
  File "F:/python/课后习题/第53讲爬虫/2020.6.6 从贴吧上爬取图片.py", line 17, in get_img
urllib.request.urlretrieve(each, filename, None)
3.估计是each的url不对，用print打印一下：
for each in img_list:
      filename = each.split('/')[-1]
      print(each)
      exit(0)
      urllib.request.urlretrieve(each, filename, None)
4.运行后输出：
"http://tiebapic.baidu.com/forum/w%3D580/sign=d01e5ef2ce33c895a67e9873e1127397/4a408c01a18b87d6f845827a100828381f30fd29.jpg
5.问题就在开头的双引号，说明re的时候匹配多双引号了，那就双引号移到括号外面：
p = re.compile(r'<img class="BDE_Image" src="(.*?\.jpg)"')
6.再次输出ok了：
http://tiebapic.baidu.com/forum/ ... 00828381f30fd29.jpg

Ericwooooo0622 · 发表于 2020-6-7 19:23:48

suchocolate 发表于 2020-6-7 10:25
1.报错说第17行说未知的url类型：urllib.error.URLError:

十分感谢！问题解决啦！

账号		自动登录	找回密码
密码			立即注册

[已解决]求助求助！关于在贴吧上爬取图片的问题！

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块