异步爬虫，东西是爬到了，但是结束时一片红，是什么问题？,Python交流,编程语言专区,鱼C论坛

鱼-wsyy 发表于 2023-11-26 18:16:02

异步爬虫，东西是爬到了，但是结束时一片红，是什么问题？

代码如下
import asyncio
import aiohttp
import aiofile
import os

class HeroSkin:
def __init__(self):
   self.json_url = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js?ts=2834976'
   self.skin_url = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js?ts=2834984'
   self.headers = {
         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
         }

async def get_image_content(self,session,heroid):
   async with session.get(self.skin_url.format(heroid),headers=self.headers) as response:
         result = await response.json(content_type = None)
         for skin in result['skins']:

            skin_name = skin['name'].replace(' ','_')
            hero_title = skin['heroTitle']
            img_url = skin['mainImg']
            async with session.get(img_url) as skin_res:
               content = await skin_res.read()
               async with aiofile.async_open('./images/'+hero_title+'-'+skin_name+'.jpg',mode='wb') as f:
                     await f.write(content)
                     print('保存成功:',skin_name)

async def main(self):
   tasks = list()
   async with aiohttp.ClientSession() as session:
         async with session.get(self.json_url,headers=self.headers) as response:
            result = await response.json(content_type = None)
            for item in result['hero']:
               heroid = item['heroId']
               coro_obj = self.get_image_content(session,heroid)
               tasks.append(asyncio.create_task(coro_obj))
            await asyncio.wait(tasks)

if __name__ == '__main__':
if not os.path.exists('./images'):
   os.mkdir('./images')

hero_skin = HeroSkin()
asyncio.run(hero_skin.main())

FishC_GPT 发表于 2023-11-26 18:16:18

在异步爬虫中，红色错误可能表明出现了异常，通过查看错误信息可以确定具体的问题。一般来说，常见的问题包括网络连接错误、网站反爬虫机制导致的请求被阻止、文件路径不存在等。

首先，你可以尝试打印出异常信息，以便进一步分析问题。在`get_image_content`方法中的`await response.json(content_type = None)`处使用`try...except`块，捕获异常并打印出错误信息。修改代码如下：

async def get_image_content(self,session,heroid):
async with session.get(self.skin_url.format(heroid),headers=self.headers) as response:
   try:
         result = await response.json(content_type=None)
         for skin in result:
            # 省略部分代码
            pass
   except Exception as e:
         print('出现异常:', str(e))

运行程序，查看控制台输出的错误信息，根据错误信息尝试解决问题。

如果是网络连接错误，可以检查网络环境是否正常，或者尝试使用代理进行请求。

如果是网站反爬虫机制导致的请求被阻止，可以尝试在请求头中加入更多的参数，模拟浏览器行为，例如设置`Referer`头字段。

如果是文件路径不存在的问题，可以先检查路径是否正确，然后确保路径对应的文件夹存在。可以在程序开始部分添加一个判断并创建文件夹的逻辑，如下所示：

if __name__ == '__main__':
if not os.path.exists('./images'):
   os.mkdir('./images')

hero_skin = HeroSkin()
asyncio.run(hero_skin.main())

这样可以确保目标文件夹`./images`存在，否则会自动创建。

以上回复来自 -- ChatGPT（FishC官方接口），如未能正确解答您的问题，请继续追问。

鱼-wsyy 发表于 2023-11-26 18:38:57

出现异常： Server disconnected
出现异常： No such file or directory: './images/阿卡丽-K/DA_阿卡丽.jpg'
出现异常：信号灯超时时间已到
出现异常：

字符串异常忽略，这个没注意到，剩下的是网络问题和并发量太大？？？

FishC_GPT 发表于 2023-11-26 18:39:12

鱼-wsyy 发表于 2023-11-26 18:38
出现异常： Server disconnected
出现异常：No suc ...

根据你提供的错误信息，可以得出以下结论：

1. "Server disconnected"：这个错误通常是网络连接问题导致的，可能是服务器关闭了连接或者网络不稳定。建议检查网络连接是否正常，尝试使用代理或更稳定的网络环境进行请求。

2. "No such file or directory: './images/阿卡丽-K/DA_阿卡丽.jpg'"：这个错误是因为指定的文件夹`./images/阿卡丽-K`不存在。请确保在保存图片之前创建好对应的文件夹。

3. "信号灯超时时间已到"：这是异步操作中的超时错误，可能是由于网络连接较慢或请求响应时间过长导致的。可以尝试增加超时时间，例如通过设置`timeout`参数来延长等待时间。

4. "字符串异常忽略"：这个错误提示不明确，可能是某个具体的字符串处理出现了异常。你可以检查相关代码，确保字符串的处理方法正确。

至于并发量太大的问题，可以尝试减少并发请求数量，通过调整`asyncio.wait`中的并发任务数量来控制同时进行的请求数量。也可以考虑使用限制并发数量的工具，例如`asyncio.Semaphore`来限制并发数量。

综上所述，尝试解决以上问题，并确保网络连接稳定，文件夹存在，超时时间合适，并且对字符串处理进行正确的异常处理。这样应该能够解决你遇到的问题。

以上回复来自 -- ChatGPT（FishC官方接口），如未能正确解答您的问题，请继续追问。

isdkz 发表于 2023-11-26 18:50:16

因为异步当其中一个协程出错的时候不会导致程序中断，所以程序结束后一并抛出

而你的程序中导致协程异常的原因就两个：

1、英雄名中带着无法作为windwos文件名的非法字符

2、有一些英雄不知道是没有图片还是怎样，总之 skin['mainImg'] 字段为空，所以出现了无效的 url 错误

对你的程序修改如下：
import asyncio
import aiohttp
import aiofile
import os
import re

class HeroSkin:
def __init__(self):
   self.json_url = 'https://game.gtimg.cn/images/lol/act/img/js/heroList/hero_list.js?ts=2834976'
   self.skin_url = 'https://game.gtimg.cn/images/lol/act/img/js/hero/{}.js?ts=2834984'
   self.headers = {
         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36'
         }

async def get_image_content(self,session,heroid):
   async with session.get(self.skin_url.format(heroid),headers=self.headers) as response:
         result = await response.json(content_type = None)
         for skin in result['skins']:

            skin_name = skin['name'].replace(' ','_')
            hero_title = skin['heroTitle']
            img_url = skin['mainImg']
            if img_url:
               async with session.get(img_url) as skin_res:
                     content = await skin_res.read()
                     filename = hero_title+'-'+skin_name+'.jpg'
                     # 定义文件名非法字符的正则表达式模式
                     pattern = r"[\\/:*?\"<>|]"

                     # 使用re.sub()函数将非法字符替换为空字符
                     new_filename = re.sub(pattern, "", filename)

                     async with aiofile.async_open(os.path.join('images', new_filename),mode='wb') as f:
                        await f.write(content)
                        print('保存成功:',skin_name)

async def main(self):
   tasks = list()
   async with aiohttp.ClientSession() as session:
         async with session.get(self.json_url,headers=self.headers) as response:
            result = await response.json(content_type = None)
            for item in result['hero']:
               heroid = item['heroId']
               coro_obj = self.get_image_content(session,heroid)
               tasks.append(asyncio.create_task(coro_obj))
            await asyncio.wait(tasks)

if __name__ == '__main__':
if not os.path.exists('./images'):
   os.mkdir('./images')

hero_skin = HeroSkin()
asyncio.run(hero_skin.main())

页: [1]

鱼C论坛's Archiver

异步爬虫，东西是爬到了，但是结束时一片红，是什么问题？