qingzhaobing 发表于 2020-5-3 16:09:25

关于scrapy框架的imagespipeline的应用!!求大佬帮忙看看!!

本帖最后由 qingzhaobing 于 2020-5-3 16:18 编辑

我是跟着一个教程学着写的爬取起汽车之家宝马5系的图片.
items.py设置如下:
import scrapy
class BmwItem(scrapy.Item):
    category = scrapy.Field()
    image_urls = scrapy.Field()
    images = scrapy.Field()
setting.py设置如下:
import os

BOT_NAME = 'bmw'

SPIDER_MODULES = ['bmw.spiders']
NEWSPIDER_MODULE = 'bmw.spiders'

ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
   'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'

}
ITEM_PIPELINES = {
   # 'bmw.pipelines.BmwPipeline': 300,
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')

Spider爬虫代码如下:
import scrapy
from bmw.items import BmwItem

class Bmw5Spider(scrapy.Spider):
    name = 'bmw5'
    allowed_domains = ['car.autohome.com.cn']
    start_urls = ['https://car.autohome.com.cn/pic/series/65.html']

    def parse(self, response):
      uiboxs = response.xpath('//div[@class="uibox"]')
      for uibox in uiboxs:
            category = uibox.xpath('.//div[@class="uibox-title"]/a/text()').get()
            urls = uibox.xpath('.//img/@src').getall()
            #for url in urls:
                #url = response.urljoin(url)
            # map函数:将urls中的每一项传递给表达式生成新的url,再按顺序生成新的urls列表对象.
            # 注意:这里生成出来的是对象,还要用list方法转换成列表
            urls = list(map(lambda url:response.urljoin(url),urls))
            item = BmwItem(category=category,image_urls=urls)
            yield item
            

pipelines.py设置如下:
import os
import urllib.request

class BmwPipeline:
    def __init__(self):
      self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
      
      if not os.path.exists(self.path):
            os.mkdir(self.path)
                     
    def process_item(self, item, spider):
      category = item['category']
      urls = item['image_urls']

      category_path = os.path.join(self.path,category)
      if not os.path.exists(category_path):
            os.mkdir(category_path)
            
      for url in urls:
            image_name = url.split('_')[-1]
            urllib.request.urlretrieve(url,os.path.join(category_path,image_name))

      return item

运行 scrapy crawl bmw5后代码如下:
2020-05-03 16:03:48 INFO: Scrapy 2.1.0 started (bot: bmw)
2020-05-03 16:03:48 INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) , pyOpenSSL 19.1.0 (OpenSSL 1.1.1g21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.16299-SP0
2020-05-03 16:03:48 DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-03 16:03:48 INFO: Overridden settings:
{'BOT_NAME': 'bmw',
'NEWSPIDER_MODULE': 'bmw.spiders',
'SPIDER_MODULES': ['bmw.spiders']}
2020-05-03 16:03:48 INFO: Telnet Password: b0c17134dd3780b7
2020-05-03 16:03:48 INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-05-03 16:03:49 INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 16:03:49 INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']

后来我又把pipelines.py的设置改为如下:(这里是我自己根据感觉改的,跟教程不一样)
import os
import urllib.request
from scrapy.pipelines.images import ImagesPipelines

class ImagesPipelines:
    def __init__(self):
      self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
      
      if not os.path.exists(self.path):
            os.mkdir(self.path)
                     
    def process_item(self, item, spider):
      category = item['category']
      urls = item['image_urls']

      category_path = os.path.join(self.path,category)
      if not os.path.exists(category_path):
            os.mkdir(category_path)
            
      for url in urls:
            image_name = url.split('_')[-1]
            urllib.request.urlretrieve(url,os.path.join(category_path,image_name))

      return item

运行结果还是一样,没有启用管道.

如果我没理解错的话,这里运行结果甚至连item里的参数都没传过去.

但是教程里也是这样设置的,没有重新改写ImagesPipeline类里的get_media_requests(self, item, info)和item_completed(self, results, item, info)

为什么教程里使用默认的ImagesPipeline能下载图片,我一模一样的代码就不行..

求大佬指点指点,万分感谢!!!

qingzhaobing 发表于 2020-5-3 16:59:55

求求来大佬 看看这里 拯救一下弱小无助的萌新{:10_266:}

qingzhaobing 发表于 2020-5-3 19:00:56

求求来个大佬 救救急!!

qingzhaobing 发表于 2020-5-3 23:47:57

{:10_266:}救救孩子吧...求大佬康康

qingzhaobing 发表于 2020-5-4 16:50:51

就没有人吗...{:10_266:}
页: [1]
查看完整版本: 关于scrapy框架的imagespipeline的应用!!求大佬帮忙看看!!