|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 qingzhaobing 于 2020-5-3 16:18 编辑
我是跟着一个教程学着写的爬取起汽车之家宝马5系的图片.
items.py设置如下:import scrapy
class BmwItem(scrapy.Item):
category = scrapy.Field()
image_urls = scrapy.Field()
images = scrapy.Field()
setting.py设置如下:import os
BOT_NAME = 'bmw'
SPIDER_MODULES = ['bmw.spiders']
NEWSPIDER_MODULE = 'bmw.spiders'
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36'
}
ITEM_PIPELINES = {
# 'bmw.pipelines.BmwPipeline': 300,
'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
Spider爬虫代码如下:import scrapy
from bmw.items import BmwItem
class Bmw5Spider(scrapy.Spider):
name = 'bmw5'
allowed_domains = ['car.autohome.com.cn']
start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
def parse(self, response):
uiboxs = response.xpath('//div[@class="uibox"]')[1:]
for uibox in uiboxs:
category = uibox.xpath('.//div[@class="uibox-title"]/a/text()').get()
urls = uibox.xpath('.//img/@src').getall()
#for url in urls:
#url = response.urljoin(url)
# map函数:将urls中的每一项传递给表达式生成新的url,再按顺序生成新的urls列表对象.
# 注意:这里生成出来的是对象,还要用list方法转换成列表
urls = list(map(lambda url:response.urljoin(url),urls))
item = BmwItem(category=category,image_urls=urls)
yield item
pipelines.py设置如下:import os
import urllib.request
class BmwPipeline:
def __init__(self):
self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
if not os.path.exists(self.path):
os.mkdir(self.path)
def process_item(self, item, spider):
category = item['category']
urls = item['image_urls']
category_path = os.path.join(self.path,category)
if not os.path.exists(category_path):
os.mkdir(category_path)
for url in urls:
image_name = url.split('_')[-1]
urllib.request.urlretrieve(url,os.path.join(category_path,image_name))
return item
运行 scrapy crawl bmw5后代码如下:2020-05-03 16:03:48 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: bmw)
2020-05-03 16:03:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 22:45:29) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-10-10.0.16299-SP0
2020-05-03 16:03:48 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-03 16:03:48 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'bmw',
'NEWSPIDER_MODULE': 'bmw.spiders',
'SPIDER_MODULES': ['bmw.spiders']}
2020-05-03 16:03:48 [scrapy.extensions.telnet] INFO: Telnet Password: b0c17134dd3780b7
2020-05-03 16:03:48 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-05-03 16:03:49 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 16:03:49 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
后来我又把pipelines.py的设置改为如下:(这里是我自己根据感觉改的,跟教程不一样)import os
import urllib.request
from scrapy.pipelines.images import ImagesPipelines
class ImagesPipelines:
def __init__(self):
self.path = os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')
if not os.path.exists(self.path):
os.mkdir(self.path)
def process_item(self, item, spider):
category = item['category']
urls = item['image_urls']
category_path = os.path.join(self.path,category)
if not os.path.exists(category_path):
os.mkdir(category_path)
for url in urls:
image_name = url.split('_')[-1]
urllib.request.urlretrieve(url,os.path.join(category_path,image_name))
return item
运行结果还是一样,没有启用管道.
如果我没理解错的话,这里运行结果甚至连item里的参数都没传过去.
但是教程里也是这样设置的,没有重新改写ImagesPipeline类里的get_media_requests(self, item, info)和item_completed(self, results, item, info)
为什么教程里使用默认的ImagesPipeline能下载图片,我一模一样的代码就不行..
求大佬指点指点,万分感谢!!! |
|