爬虫大神求救(百度两天之后还是没有完成)
最近学习了小甲鱼的OOXX之后,一直想要实践,但是不是采用SCRAPY框架的方法。而我想要使用Scrapy框架进行爬取。我就根据63章节来进行了爬虫图片练习。但是,总是爬不成功。希望各位前辈帮忙指导一下。
执行后显示的代码:
python@ubuntu:~/tupian$ scrapy crawl tupian
2016-07-24 02:00:38 INFO: Scrapy 1.0.3 started (bot: tupian)
2016-07-24 02:00:38 INFO: Optional features available: ssl, http11, boto
2016-07-24 02:00:38 INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tupian.spiders', 'SPIDER_MODULES': ['tupian.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'tupian'}
2016-07-24 02:00:38 INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-24 02:00:38 DEBUG: Retrieving credentials from metadata server.
2016-07-24 02:00:39 ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-07-24 02:00:39 ERROR: Unable to read instance data, giving up
2016-07-24 02:00:39 INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-24 02:00:39 INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-24 02:00:39 INFO: Enabled item pipelines: TupianPipeline
2016-07-24 02:00:39 INFO: Spider opened
2016-07-24 02:00:39 INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-24 02:00:39 DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-24 02:00:39 INFO: Closing spider (finished)
2016-07-24 02:00:39 INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 629425),
'log_count/DEBUG': 2,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 626707)}
2016-07-24 02:00:39 INFO: Spider closed (finished)
代码如下:
tupianspider.py文件
#coding:utf-8
import scrapy
from tupian.items import TupianItem
from scrapy.crawler import CrawlerProcess
class tupianSpidier(scrapy.Spider):
name = 'tupian' #该名字为爬虫的独立名字
allowed_domains = ["toumiaola.com"]#图片在该网址范围内
star_urls= ["http://www.toumiaola.com/youtu/"]#图片起始位置
def parse_img(self, response):
item = TupianItem()
item['image_urls'] = response.xpath('//img//@src').extract()#提取图片地址
print'image_urls', item['image_urls']
yield item
new_url = ('http://www.toumiaola.com/youtu/' + ')]).extract_first()#寻找下一页
print'new_url', new_url
if new_url:
yield scrapy.Request(new_url, callback=self.parse)
items.py文件
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class TupianItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
image_urls = scrapy.Field()#图片地址
pipelines.py文件
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import urllib
from tupian import settings
class TupianPipeline(object):
def process_item(self, item, spider):
dir_path = '%s%s'%(settings.IMAGES_STORE,spider.name)
print'dir_path', dir_path
if not os.path.exists(dir_path):
os.makedirs(dir_path)
for image_url in item['image_urls']:
list_name = image_url.split('/')
file_path = list_name#图片名称
#print 'filename',file_path
if os.path.exists(list_name):
continue
with open(file_path, 'wb')as file_writer:
conn = urllib.urlopen(image_url)#下载图片
file_writer.write(conn.read())
file_writer.close()
return item
Settings.py文件
BOT_NAME = 'tupian'
SPIDER_MODULES = ['tupian.spiders']
NEWSPIDER_MODULE = 'tupian.spiders'
ITEM_PIPELINES = {
'tupian.pipelines.TupianPipeline':1,
}
IMAGES_STORE='E:'
DOWNLOAD_DELAY=0.25
诚心的请教,希望可以帮忙解答。 唉,前期每一步都是深渊,努力爬上高处。 等我学会了我来帮你解答
弧矢七 发表于 2016-7-24 10:06
等我学会了我来帮你解答
好吧。期待~ 感觉指向地址也正确。怎么会就结束了呢 python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 INFO: Scrapy 1.0.3 started (bot: jiandan)
2016-07-25 00:02:33 INFO: Optional features available: ssl, http11, boto
2016-07-25 00:02:33 INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jiandan.spiders', 'SPIDER_MODULES': ['jiandan.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'jiandan'}
2016-07-25 00:02:33 INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-25 00:02:33 DEBUG: Retrieving credentials from metadata server.
2016-07-25 00:02:34 ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-07-25 00:02:34 ERROR: Unable to read instance data, giving up
2016-07-25 00:02:34 INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-25 00:02:34 INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-25 00:02:34 INFO: Enabled item pipelines: JiandanPipeline
2016-07-25 00:02:34 INFO: Spider opened
2016-07-25 00:02:34 INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-25 00:02:34 DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-25 00:02:34 DEBUG: Redirecting (302) to <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> from <GET http://jandan.net/ooxx>
2016-07-25 00:02:34 DEBUG: Retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 1 times): 503 Service Unavailable
2016-07-25 00:02:35 DEBUG: Retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 2 times): 503 Service Unavailable
2016-07-25 00:02:35 DEBUG: Gave up retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 3 times): 503 Service Unavailable
2016-07-25 00:02:35 DEBUG: Crawled (503) <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (referer: None)
2016-07-25 00:02:35 DEBUG: Ignoring response <503 http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx>: HTTP status code is not handled or not allowed
2016-07-25 00:02:35 INFO: Closing spider (finished)
2016-07-25 00:02:35 INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1063,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 8516,
'downloader/response_count': 4,
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 24, 16, 2, 35, 655094),
'log_count/DEBUG': 8,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 7, 24, 16, 2, 34, 321404)}
2016-07-25 00:02:35 INFO: Spider closed (finished)
竞技山 发表于 2016-7-25 01:05
python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 INFO: Scrapy 1.0. ...
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,
302': 1次
503': 3次
jiandan发现你是爬虫,拒绝了 SixPy 发表于 2016-7-25 06:23
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,
原来代码是这么看的....敞开了世界的大门,我原来还以为只看最下面的INFO。。。学习了。 厉害了,学习学习 看到这些代码,我瞬间懵逼了。。水平还不如你。。。 {:10_256:} 还没有学的这里,完全看不懂 也在學爬蟲中..同楊等大神恢復,也參考學習 66666 同樣學習中... {:10_254:}看看 进来看看了。。 666 学习 还没看到这里 学习
页:
[1]
2