鱼C论坛

 找回密码
 立即注册
查看: 2159|回复: 6

关于爬取斗图啦网站的一些异常(跪求各位大佬能告诉小白是哪里出错了)

[复制链接]
发表于 2020-6-5 17:18:00 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
import requests
import parsel
import re
import concurrent.futures





headers = {
    'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4043.400'}


def send_request(url):
    '''请求数据'''
    response = requests.get(url = url, headers = headers, verify = False)
    return(response)

def parse_data(data):
    '''数据解析'''
    selector = parsel.Selector(data)
    result_list = selector.xpath('//a[@class="col-xs-6 col-sm-3"]')
    for result in result_list:
        title = result.xpath('./img/@data-original').extract_first()
        src_url = result.xpath('./img/@alt').extract_first()

        #准备文件后缀名
        all_title = title + '.' + src_url.split('.')[-1]
        yield all_title, src_url


def sava_data(file_name,data):
    '''数据保存'''
    with open('img\\' + file_name, mode = 'wb') as f:
        f.write(data)
        print('保存完成:', file_name)



def main(page):
    '''实现翻页的效果'''
    for page in range(1,page + 1):
        print('============正在爬取第{}页数据============'.format(page))
        thread_pool = concurrent.futures.ThreadPoolExecutor(max_workers=3)
        res = send_request('https://www.doutula.com/photo/list/?page={}'.format(str(page)))
        src_url = parse_data(res.text)
        for file, url in src_url:
            image_response = send_request(url)
            thread_pool.submit(save_data, file, image_response.content)


        thread_pool.shutdown()










if __name__ == '__main__':
    main(10)





一些异常的URL
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\python.exe E:/python_fruit/表情包/表情包_ronot-多线程.py
============正在爬取第1页数据============
C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\urllib3\connectionpool.py:986: InsecureRequestWarning: Unverified HTTPS request is being made to host 'www.doutula.com'. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/e ... e.html#ssl-warnings
  InsecureRequestWarning,
Traceback (most recent call last):
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 380, in prepare_url
    scheme, auth, host, port, path, query, fragment = parse_url(url)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\urllib3\util\url.py", line 392, in parse_url
    return six.raise_from(LocationParseError(source_url), None)
  File "<string>", line 3, in raise_from
urllib3.exceptions.LocationParseError: Failed to parse: 人呢?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "E:/python_fruit/表情包/表情包_ronot-多线程.py", line 64, in <module>
    main(10)
  File "E:/python_fruit/表情包/表情包_ronot-多线程.py", line 48, in main
    image_response = send_request(url)
  File "E:/python_fruit/表情包/表情包_ronot-多线程.py", line 16, in send_request
    response = requests.get(url = url, headers = headers, verify = False)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 516, in request
    prep = self.prepare_request(req)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 459, in prepare_request
    hooks=merge_hooks(request.hooks, self.hooks),
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 314, in prepare
    self.prepare_url(url, params)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\models.py", line 382, in prepare_url
    raise InvalidURL(*e.args)
requests.exceptions.InvalidURL: Failed to parse: 人呢?

Process finished with exit code 1
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

发表于 2020-6-5 22:12:02 | 显示全部楼层
parse_data的url提取有问题。其他的暂时为测到
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-6-6 10:59:40 | 显示全部楼层
Stubborn 发表于 2020-6-5 22:12
parse_data的url提取有问题。其他的暂时为测到

请问处理方法该怎么办
直接pass?
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-6-6 11:50:05 | 显示全部楼层
风尘岁月 发表于 2020-6-6 10:59
请问处理方法该怎么办
直接pass?

你提取的url连接有问题,不会打印下看下,提取的是什么东西吗? 修改下提取规则啊
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-6-6 13:42:26 | 显示全部楼层
extract_first()
scrapy 用法
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-6-6 17:41:51 | 显示全部楼层
兢兢 发表于 2020-6-6 13:42
extract_first()
scrapy 用法

我不会用scrapy框架
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2020-6-6 22:14:40 From FishC Mobile | 显示全部楼层
title = result.xpath('./img/@data-original')[0]
src_url = result.xpath('./img/@alt')[0]
获取数组的第一个数据
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-1-20 20:10

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表