鱼C论坛

 找回密码
 立即注册
查看: 2000|回复: 7

[已解决]Mac下python爬虫多次失败

[复制链接]
发表于 2020-7-11 20:54:45 | 显示全部楼层 |阅读模式
10鱼币
本帖最后由 ten$1 于 2020-7-11 22:21 编辑

第一次实验:
  1. import urllib.request
  2. from bs4 import BeautifulSoup
  3. import os

  4. def Download(url,picAlt,name):
  5.     path = 'D:\\pythonD爬虫妹子图\\'+picAlt+'\\'
  6.     if not os.path.exists(path):
  7.         os.makedirs(path)
  8.     urllib.request.urlretrieve( url, '{0}{1}.jpg'.format(path, name))

  9. header = {
  10.     "User-Agent":'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36',
  11.     'Accept': '*/*',
  12.     'Accept-Language': 'en-US,en;q=0.8',
  13.     'Cache-Control': 'max-age=0',
  14.     'Connection': 'keep-alive'
  15.     }

  16. def run(targetUrl, beginNUM ,endNUM):
  17.     req = urllib.request.Request(url=targetUrl,headers=header)
  18.     response = urllib.request.urlopen(req)
  19.     html = response.read().decode('gb2312','ignore')
  20.     soup = BeautifulSoup(html, 'html.parser')
  21.     Divs = soup.find_all('div',attrs={'id':'big-pic' })
  22.     nowpage = soup.find('span',attrs={'class':'nowpage'}).get_text()
  23.     totalpage= soup.find('span',attrs={'class':'totalpage'}).get_text()
  24.     if beginNUM ==endNUM :
  25.         return
  26.     for div in Divs:
  27.         beginNUM = beginNUM+1

  28.         if div.find("a") is None :
  29.             print("没有下一张了")
  30.             return
  31.         elif div.find("a")['href'] is None or div.find("a")['href']=="":
  32.             print("没有下一张了None")
  33.             return
  34.         print("下载信息:总进度:",beginNUM,"/",endNUM," ,正在下载套图:(",nowpage,"/",totalpage,")")

  35.         if int(nowpage)<int(totalpage):
  36.             nextPageLink ="http://www.mmonly.cc/mmtp/qcmn/" +(div.find('a')['href'])
  37.         elif int(nowpage)==int(totalpage):
  38.             nextPageLink = (div.find('a')['href'])

  39.         picLink = (div.find('a').find('img')['src'])
  40.         picAlt = (div.find('a').find('img'))['alt']
  41.         print('下载的图片链接:',picLink)
  42.         print('套图名:[ ', picAlt , ' ] ')
  43.         print('开始下载...........')
  44.         Download(picLink,picAlt, nowpage)
  45.         print("下载成功!")
  46.         print('下一页链接:',nextPageLink)
  47.         run(nextPageLink,beginNUM ,endNUM)
  48.         return


  49. if __name__ == '__main__':
  50.     targetUrl ="http://www.mmonly.cc/mmtp/qcmn/237269.html"
  51.     run(targetUrl,beginNUM=0,endNUM=70)
  52.     print(" OVER")
  53.    
复制代码

结果
  1. Traceback (most recent call last):
  2.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1350, in do_open
  3.     h.request(req.get_method(), req.selector, req.data, headers,
  4.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1240, in request
  5.     self._send_request(method, url, body, headers, encode_chunked)
  6.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1286, in _send_request
  7.     self.endheaders(body, encode_chunked=encode_chunked)
  8.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1235, in endheaders
  9.     self._send_output(message_body, encode_chunked=encode_chunked)
  10.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1006, in _send_output
  11.     self.send(msg)
  12.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 946, in send
  13.     self.connect()
  14.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1409, in connect
  15.     self.sock = self._context.wrap_socket(self.sock,
  16.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 500, in wrap_socket
  17.     return self.sslsocket_class._create(
  18.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1040, in _create
  19.     self.do_handshake()
  20.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1309, in do_handshake
  21.     self._sslobj.do_handshake()
  22. ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)

  23. During handling of the above exception, another exception occurred:

  24. Traceback (most recent call last):
  25.   File "/Users/xiaojiayudeapple/Desktop/My Code/python/A.py", line 59, in <module>
  26.     run(targetUrl,beginNUM=0,endNUM=70)
  27.   File "/Users/xiaojiayudeapple/Desktop/My Code/python/A.py", line 21, in run
  28.     response = urllib.request.urlopen(req)
  29.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
  30.     return opener.open(url, data, timeout)
  31.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 531, in open
  32.     response = meth(req, response)
  33.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 640, in http_response
  34.     response = self.parent.error(
  35.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 563, in error
  36.     result = self._call_chain(*args)
  37.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
  38.     result = func(*args)
  39.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 755, in http_error_302
  40.     return self.parent.open(new, timeout=req.timeout)
  41.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open
  42.     response = self._open(req, data)
  43.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open
  44.     result = self._call_chain(self.handle_open, protocol, protocol +
  45.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
  46.     result = func(*args)
  47.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open
  48.     return self.do_open(http.client.HTTPSConnection, req,
  49.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1353, in do_open
  50.     raise URLError(err)
  51. urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>
复制代码




                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

                               
登录/注册后可看大图

第二次:
  1. import urllib.request
  2. import os
  3. import re

  4. #打开url操作
  5. def url_open(url):
  6.         headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.108 Safari/537.36',
  7.                    'Referer': 'http://wwww.mzitu.com'}
  8.         req = urllib.request.Request(url,headers = headers)
  9.         response = urllib.request.urlopen(req)
  10.         html = response.read()
  11.         return html

  12. #获取当前图片组的最大页码数
  13. def get_maxpage(url):
  14.         html = url_open(url).decode('utf-8')
  15.         pages = re.findall(r'<span>\d{1,2}',html)
  16.         return pages[-1][6:len(pages[-1])]

  17. #传入当前页面url,返回当前页面所有图片组链接地址列表
  18. def find_imgs(url):
  19.         html = url_open(url).decode('utf-8')
  20.         imgs_url = re.findall(r'http://www.mzitu.com/\d{6}',html)
  21.         return imgs_url

  22. #传入图片组url,返回图片组中所有图片链接地址列表
  23. def find_img(url,page):
  24.         html = url_open(url + '/' + str(page)).decode('utf-8')
  25.         img_addrs = []

  26.         a = html.find('img src=')
  27.         while a != -1:
  28.                 b = html.find('.jpg" alt="',a,a+255)
  29.                 if b!= -1:
  30.                         img_addrs.append(html[a+9:b+4])
  31.                 else:
  32.                         b =a + 9
  33.                 a = html.find('img src=',b)
  34.         return img_addrs[0]


  35. #根据图片地址列表,将图片保存到folder中
  36. def save_img(folder,img_addrs):
  37.         for each in img_addrs:
  38.                 filename = each.split('/')[-1]
  39.                 print(filename)
  40.                 with open(filename,'wb') as f:
  41.                         img = url_open(each)
  42.                         f.write(img)

  43. def download(folder = 'meizi',*pages):
  44.         if not os.path.exists(folder):
  45.                 os.mkdir(folder)
  46.         os.chdir(folder)

  47.         url = 'http://www.mzitu.com'
  48.         # page_num = int(get_page(url))        #获取当前页数

  49.         for page in pages:
  50.                 page_url = url + '/page/' + str(page) + '/'
  51.                 #创建页文件夹
  52.                 pagefolder = "page-" + str(page)
  53.                 if not os.path.exists(pagefolder):
  54.                         os.mkdir(pagefolder)
  55.                 os.chdir(pagefolder)
  56.                 #获取图片组地址列表
  57.                 img_group_addrs = find_imgs(page_url)
  58.                 #对于每个图片组,获取图片地址并保存
  59.                 group = 0
  60.                 for addr in img_group_addrs:
  61.                         group += 1
  62.                         img_addrs = [find_img(addr,x) for x in range(int(get_maxpage(addr)))]
  63.                         #创建组文件夹
  64.                         groupfolder = str(page) + "-" + str(group)
  65.                         if not os.path.exists(groupfolder):
  66.                                 os.mkdir(groupfolder)
  67.                         os.chdir(groupfolder)
  68.                         save_img(groupfolder,img_addrs)
  69.                         os.chdir(os.pardir)
  70.                 os.chdir(os.pardir)

  71. if __name__ == '__main__':
  72.         download('meizi',1)#第一个参数为文件夹名,第二个参数为要爬取的页码
复制代码

结果
  1. Traceback (most recent call last):
  2.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1350, in do_open
  3.     h.request(req.get_method(), req.selector, req.data, headers,
  4.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1240, in request
  5.     self._send_request(method, url, body, headers, encode_chunked)
  6.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1286, in _send_request
  7.     self.endheaders(body, encode_chunked=encode_chunked)
  8.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1235, in endheaders
  9.     self._send_output(message_body, encode_chunked=encode_chunked)
  10.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1006, in _send_output
  11.     self.send(msg)
  12.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 946, in send
  13.     self.connect()
  14.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/http/client.py", line 1409, in connect
  15.     self.sock = self._context.wrap_socket(self.sock,
  16.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 500, in wrap_socket
  17.     return self.sslsocket_class._create(
  18.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1040, in _create
  19.     self.do_handshake()
  20.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/ssl.py", line 1309, in do_handshake
  21.     self._sslobj.do_handshake()
  22. ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)

  23. During handling of the above exception, another exception occurred:

  24. Traceback (most recent call last):
  25.   File "/Users/xiaojiayudeapple/Desktop/My Code/python/A.py", line 83, in <module>
  26.     download('meizi',1)#第一个参数为文件夹名,第二个参数为要爬取的页码
  27.   File "/Users/xiaojiayudeapple/Desktop/My Code/python/A.py", line 67, in download
  28.     img_group_addrs = find_imgs(page_url)
  29.   File "/Users/xiaojiayudeapple/Desktop/My Code/python/A.py", line 22, in find_imgs
  30.     html = url_open(url).decode('utf-8')
  31.   File "/Users/xiaojiayudeapple/Desktop/My Code/python/A.py", line 10, in url_open
  32.     response = urllib.request.urlopen(req)
  33.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 222, in urlopen
  34.     return opener.open(url, data, timeout)
  35.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 531, in open
  36.     response = meth(req, response)
  37.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 640, in http_response
  38.     response = self.parent.error(
  39.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 563, in error
  40.     result = self._call_chain(*args)
  41.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
  42.     result = func(*args)
  43.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 755, in http_error_302
  44.     return self.parent.open(new, timeout=req.timeout)
  45.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 525, in open
  46.     response = self._open(req, data)
  47.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 542, in _open
  48.     result = self._call_chain(self.handle_open, protocol, protocol +
  49.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 502, in _call_chain
  50.     result = func(*args)
  51.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1393, in https_open
  52.     return self.do_open(http.client.HTTPSConnection, req,
  53.   File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/urllib/request.py", line 1353, in do_open
  54.     raise URLError(err)
  55. urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1108)>
  56. >>>
复制代码
最佳答案
2020-7-11 20:54:46
ten$1 发表于 2020-7-11 22:23
就是 原厂封装 的IDE啊。。证书错误怎么解决?


安装了不会报没有模块 bs4 的错误才对呀,你是不是电脑不止一个版本的 Python 呢? 安装的时候看见 Successfully 才算是成功的哈

证书错误加上这串代码试试看,取消证书验证:
  1. import ssl
  2. ssl._create_default_https_context = ssl._create_unverified_context
复制代码

最佳答案

查看完整内容

安装了不会报没有模块 bs4 的错误才对呀,你是不是电脑不止一个版本的 Python 呢? 安装的时候看见 Successfully 才算是成功的哈 证书错误加上这串代码试试看,取消证书验证:
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2020-7-11 20:54:46 | 显示全部楼层    本楼为最佳答案   
ten$1 发表于 2020-7-11 22:23
就是 原厂封装 的IDE啊。。证书错误怎么解决?


安装了不会报没有模块 bs4 的错误才对呀,你是不是电脑不止一个版本的 Python 呢? 安装的时候看见 Successfully 才算是成功的哈

证书错误加上这串代码试试看,取消证书验证:
  1. import ssl
  2. ssl._create_default_https_context = ssl._create_unverified_context
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2020-7-11 20:55:45 | 显示全部楼层
这是怎么回事啊?
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2020-7-11 21:08:40 | 显示全部楼层
第一个报错是你没安装 BeautifulSoup 模块吧?

第二个报错是证书错误吧?SSL

评分

参与人数 1鱼币 +2 收起 理由
ten$1 + 2

查看全部评分

小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2020-7-11 22:20:54 | 显示全部楼层
Twilight6 发表于 2020-7-11 21:08
第一个报错是你没安装 BeautifulSoup 模块吧?

第二个报错是证书错误吧?SSL

BeautifulSoup装好了
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2020-7-11 22:22:09 | 显示全部楼层
ten$1 发表于 2020-7-11 22:20
BeautifulSoup装好了

你用的 是 PyCharm ?
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2020-7-11 22:23:23 | 显示全部楼层
Twilight6 发表于 2020-7-11 22:22
你用的 是 PyCharm ?

就是 原厂封装 的IDE啊。。证书错误怎么解决?
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2020-7-11 22:28:26 | 显示全部楼层
Twilight6 发表于 2020-7-11 22:26
安装了不会报没有模块 bs4 的错误才对呀,你是不是电脑不止一个版本的 Python 呢? 安装的时候看见 Suc ...

好了,谢谢谢谢
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-6-23 09:56

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表