Ericwooooo0622 发表于 2020-6-6 16:36:23

求助求助!关于在贴吧上爬取图片的问题!

上了小甲鱼的爬虫入门课,学着他在百度贴吧上爬图片下来,源代码基本都一样,但是最后报错了

urllib.error.URLError: <urlopen error unknown url type: "http>

网上说是缺啥ssl模块,但是我研究了好久实在是不知道咋整,

Twilight6 发表于 2020-6-6 16:46:33

网上说是缺啥ssl模块,但是我研究了好久实在是不知道咋整
python -m pip install ssl -i https://pypi.tuna.tsinghua.edu.cn/simple
你安装下 ssl 模块 然后在文件最前一行导入ssl模块import ssl 然后试试

Ericwooooo0622 发表于 2020-6-6 16:55:53

Twilight6 发表于 2020-6-6 16:46
你安装下 ssl 模块 然后在文件最前一行导入ssl模块 然后试试

C:\Users\86738\Desktop\999de2d7424fd2482b827448d789999.png

安装不了诶 报错 ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

suchocolate 发表于 2020-6-6 17:38:45

报错截图不全,把代码和摆错都贴上来。

Twilight6 发表于 2020-6-6 17:47:56

本帖最后由 Twilight6 于 2020-6-6 17:49 编辑

Ericwooooo0622 发表于 2020-6-6 16:55
安装不了诶 报错 ERROR: Command errored out with exit status 1: python setup.py egg_info Check...

升级下这两个 然后重复 2L操作
python -m pip install --upgrade pip
python -m pip install --upgrade setuptools

Ericwooooo0622 发表于 2020-6-6 18:05:08

suchocolate 发表于 2020-6-6 17:38
报错截图不全,把代码和摆错都贴上来。

PyDev console: starting.
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46) on win32
runfile('F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py', wdir='F:/python/课后习题/第53讲 爬虫')
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "E:\Program Files\JetBrains\PyCharm Community Edition 2019.3.3\plugins\python-ce\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)# execute the script
File "E:\Program Files\JetBrains\PyCharm Community Edition 2019.3.3\plugins\python-ce\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py", line 21, in <module>
    get_img(open_url(url))
File "F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py", line 17, in get_img
    urllib.request.urlretrieve(each, filename, None)
File "C:\Program Files\Python38\lib\urllib\request.py", line 247, in urlretrieve
    with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Program Files\Python38\lib\urllib\request.py", line 222, in urlopen
    return opener.open(url, data, timeout)
File "C:\Program Files\Python38\lib\urllib\request.py", line 525, in open
    response = self._open(req, data)
File "C:\Program Files\Python38\lib\urllib\request.py", line 547, in _open
    return self._call_chain(self.handle_open, 'unknown',
File "C:\Program Files\Python38\lib\urllib\request.py", line 502, in _call_chain
    result = func(*args)
File "C:\Program Files\Python38\lib\urllib\request.py", line 1390, in unknown_open
    raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: "http>
这个是报错的全部代码了

Ericwooooo0622 发表于 2020-6-6 18:05:42

Twilight6 发表于 2020-6-6 17:47
升级下这两个 然后重复 2L操作

我刚刚试过这个方法了,但还是没法安装T T

suchocolate 发表于 2020-6-6 20:14:18

Ericwooooo0622 发表于 2020-6-6 18:05
PyDev console: starting.
Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 23:11:46)

发代码!!!

Ericwooooo0622 发表于 2020-6-7 09:46:31

suchocolate 发表于 2020-6-6 20:14
发代码!!!

啊抱歉哈哈哈
import urllib.request
import re


def open_url(url):
    req = urllib.request.Request(url)
    page = urllib.request.urlopen(req)
    req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36 Edg/83.0.478.44')
    html = page.read().decode('utf-8')
    return html

def get_img(html):
    p = re.compile(r'<img class="BDE_Image" src=("[^"]*\.jpg)"')
    img_list = re.findall(p,html)
    for each in img_list:
      filename = each.split('/')[-1]
      urllib.request.urlretrieve(each, filename, None)

if __name__ == '__main__':
    url = 'https://tieba.baidu.com/p/6591896494'
    get_img(open_url(url))

这是全部的代码

suchocolate 发表于 2020-6-7 10:25:09

本帖最后由 suchocolate 于 2020-6-7 10:31 编辑

Ericwooooo0622 发表于 2020-6-7 09:46
啊抱歉哈哈哈
import urllib.request
import re

1.报错说第17行说未知的url类型:urllib.error.URLError: <urlopen error unknown url type: "http>
2.也就是报错里提示的:
File "F:/python/课后习题/第53讲 爬虫/2020.6.6 从贴吧上爬取图片.py", line 17, in get_img
    urllib.request.urlretrieve(each, filename, None)
3.估计是each的url不对,用print打印一下:
    for each in img_list:
      filename = each.split('/')[-1]
      print(each)
      exit(0)
      urllib.request.urlretrieve(each, filename, None)
4.运行后输出:
"http://tiebapic.baidu.com/forum/w%3D580/sign=d01e5ef2ce33c895a67e9873e1127397/4a408c01a18b87d6f845827a100828381f30fd29.jpg
5.问题就在开头的双引号,说明re的时候匹配多双引号了,那就双引号移到括号外面:
    p = re.compile(r'<img class="BDE_Image" src="(.*?\.jpg)"')
6.再次输出ok了:
http://tiebapic.baidu.com/forum/ ... 00828381f30fd29.jpg

Ericwooooo0622 发表于 2020-6-7 19:23:48

suchocolate 发表于 2020-6-7 10:25
1.报错说第17行说未知的url类型:urllib.error.URLError:

十分感谢!问题解决啦!
页: [1]
查看完整版本: 求助求助!关于在贴吧上爬取图片的问题!