|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 qingzhaobing 于 2020-5-3 16:18 编辑
我是跟着一个教程学着写的爬取起汽车之家宝马5系的图片.
items.py设置如下:
- import scrapy
- class BmwItem(scrapy.Item):
- category = scrapy.Field()
- image_urls = scrapy.Field()
- images = scrapy.Field()
复制代码
setting.py设置如下:
- import os
- BOT_NAME = 'bmw'
- SPIDER_MODULES = ['bmw.spiders']
- NEWSPIDER_MODULE = 'bmw.spiders'
- ROBOTSTXT_OBEY = False
- DEFAULT_REQUEST_HEADERS = {
- 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
- 'Accept-Language': 'en',
- 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
- }
- ITEM_PIPELINES = {
- # 'bmw.pipelines.BmwPipeline': 300,
- 'scrapy.pipelines.images.ImagesPipeline': 1,
- }
- IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
复制代码
Spider爬虫代码如下:
- import scrapy
- from bmw.items import BmwItem
- class Bmw5Spider(scrapy.Spider):
- name = 'bmw5'
- allowed_domains = ['car.autohome.com.cn']
- start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
- def parse(self, response):
- uiboxs = response.xpath('//div[@class="uibox"]')[1:]
- for uibox in uiboxs:
- category = uibox.xpath('.//div[@class="uibox-title"]/a/text()').get()
- urls = uibox.xpath('.//img/@src').getall()
- #for url in urls:
- #url = response.urljoin(url)
- # map函数:将urls中的每一项传递给表达式生成新的url,再按顺序生成新的urls列表对象.
- # 注意:这里生成出来的是对象,还要用list方法转换成列表
- urls = list(map(lambda url:response.urljoin(url),urls))
- item = BmwItem(category=category,image_urls=urls)
- yield item
-
复制代码
pipelines.py设置如下:
- import os
- import urllib.request
- class BmwPipeline:
- def __init__(self):
- self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
-
- if not os.path.exists(self.path):
- os.mkdir(self.path)
-
- def process_item(self, item, spider):
- category = item['category']
- urls = item['image_urls']
- category_path = os.path.join(self.path,category)
- if not os.path.exists(category_path):
- os.mkdir(category_path)
-
- for url in urls:
- image_name = url.split('_')[-1]
- urllib.request.urlretrieve(url,os.path.join(category_path,image_name))
- return item
复制代码
运行 scrapy crawl bmw5后代码如下:
- 2020-05-03 16:03:48 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: bmw)
- 2020-05-03 16:03:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.16299-SP0
- 2020-05-03 16:03:48 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
- 2020-05-03 16:03:48 [scrapy.crawler] INFO: Overridden settings:
- {'BOT_NAME': 'bmw',
- 'NEWSPIDER_MODULE': 'bmw.spiders',
- 'SPIDER_MODULES': ['bmw.spiders']}
- 2020-05-03 16:03:48 [scrapy.extensions.telnet] INFO: Telnet Password: b0c17134dd3780b7
- 2020-05-03 16:03:48 [scrapy.middleware] INFO: Enabled extensions:
- ['scrapy.extensions.corestats.CoreStats',
- 'scrapy.extensions.telnet.TelnetConsole',
- 'scrapy.extensions.logstats.LogStats']
- 2020-05-03 16:03:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
- ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
- 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
- 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
- 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
- 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
- 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
- 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
- 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
- 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
- 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
- 'scrapy.downloadermiddlewares.stats.DownloaderStats']
- 2020-05-03 16:03:49 [scrapy.middleware] INFO: Enabled spider middlewares:
- ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
- 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
- 'scrapy.spidermiddlewares.referer.RefererMiddleware',
- 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
- 'scrapy.spidermiddlewares.depth.DepthMiddleware']
复制代码
后来我又把pipelines.py的设置改为如下:(这里是我自己根据感觉改的,跟教程不一样)
- import os
- import urllib.request
- from scrapy.pipelines.images import ImagesPipelines
- class ImagesPipelines:
- def __init__(self):
- self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
-
- if not os.path.exists(self.path):
- os.mkdir(self.path)
-
- def process_item(self, item, spider):
- category = item['category']
- urls = item['image_urls']
- category_path = os.path.join(self.path,category)
- if not os.path.exists(category_path):
- os.mkdir(category_path)
-
- for url in urls:
- image_name = url.split('_')[-1]
- urllib.request.urlretrieve(url,os.path.join(category_path,image_name))
- return item
复制代码
运行结果还是一样,没有启用管道.
如果我没理解错的话,这里运行结果甚至连item里的参数都没传过去.
但是教程里也是这样设置的,没有重新改写ImagesPipeline类里的get_media_requests(self, item, info)和item_completed(self, results, item, info)
为什么教程里使用默认的ImagesPipeline能下载图片,我一模一样的代码就不行..
求大佬指点指点,万分感谢!!! |
|