|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
最近学习了小甲鱼的OOXX之后,一直想要实践,但是不是采用SCRAPY框架的方法。而我想要使用Scrapy框架进行爬取。我就根据63章节来进行了爬虫图片练习。
但是,总是爬不成功。希望各位前辈帮忙指导一下。
执行后显示的代码:
python@ubuntu:~/tupian$ scrapy crawl tupian
2016-07-24 02:00:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: tupian)
2016-07-24 02:00:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-07-24 02:00:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tupian.spiders', 'SPIDER_MODULES': ['tupian.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'tupian'}
2016-07-24 02:00:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-24 02:00:38 [boto] DEBUG: Retrieving credentials from metadata server.
2016-07-24 02:00:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-07-24 02:00:39 [boto] ERROR: Unable to read instance data, giving up
2016-07-24 02:00:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-24 02:00:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-24 02:00:39 [scrapy] INFO: Enabled item pipelines: TupianPipeline
2016-07-24 02:00:39 [scrapy] INFO: Spider opened
2016-07-24 02:00:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-24 02:00:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-24 02:00:39 [scrapy] INFO: Closing spider (finished)
2016-07-24 02:00:39 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 629425),
'log_count/DEBUG': 2,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 626707)}
2016-07-24 02:00:39 [scrapy] INFO: Spider closed (finished)
代码如下:
tupianspider.py文件
#coding:utf-8
import scrapy
from tupian.items import TupianItem
from scrapy.crawler import CrawlerProcess
class tupianSpidier(scrapy.Spider):
name = 'tupian' #该名字为爬虫的独立名字
allowed_domains = ["toumiaola.com"]#图片在该网址范围内
star_urls= ["http://www.toumiaola.com/youtu/"]#图片起始位置
def parse_img(self, response):
item = TupianItem()
item['image_urls'] = response.xpath('//img//@src').extract()#提取图片地址
print'image_urls', item['image_urls']
yield item
new_url = ('http://www.toumiaola.com/youtu/' + [response.xpath('//a[@href="list_/w"]')]).extract_first()#寻找下一页
print'new_url', new_url
if new_url:
yield scrapy.Request(new_url, callback=self.parse)
items.py文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TupianItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()#图片地址
pipelines.py文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import urllib
from tupian import settings
class TupianPipeline(object):
def process_item(self, item, spider):
dir_path = '%s%s'%(settings.IMAGES_STORE,spider.name)
print'dir_path', dir_path
if not os.path.exists(dir_path):
os.makedirs(dir_path)
for image_url in item['image_urls']:
list_name = image_url.split('/')
file_path = list_name[len(list_name)-1]#图片名称
#print 'filename',file_path
if os.path.exists(list_name):
continue
with open(file_path, 'wb')as file_writer:
conn = urllib.urlopen(image_url)#下载图片
file_writer.write(conn.read())
file_writer.close()
return item
Settings.py文件
BOT_NAME = 'tupian'
SPIDER_MODULES = ['tupian.spiders']
NEWSPIDER_MODULE = 'tupian.spiders'
ITEM_PIPELINES = {
'tupian.pipelines.TupianPipeline':1,
}
IMAGES_STORE='E:'
DOWNLOAD_DELAY=0.25
诚心的请教,希望可以帮忙解答。
竞技山 发表于 2016-7-25 01:05
python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 [scrapy] INFO: Scrapy 1.0. ...
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,
302': 1次
503': 3次
jiandan发现你是爬虫,拒绝了
|
|