[已解决]爬虫大神求救（百度两天之后还是没有完成）

竞技山 · 发表于 2016-7-24 02:10:57

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

最近学习了小甲鱼的OOXX之后，一直想要实践，但是不是采用SCRAPY框架的方法。而我想要使用Scrapy框架进行爬取。我就根据63章节来进行了爬虫图片练习。
但是，总是爬不成功。希望各位前辈帮忙指导一下。
执行后显示的代码：

python@ubuntu:~/tupian$ scrapy crawl tupian
2016-07-24 02:00:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: tupian)
2016-07-24 02:00:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-07-24 02:00:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tupian.spiders', 'SPIDER_MODULES': ['tupian.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'tupian'}
2016-07-24 02:00:38 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-24 02:00:38 [boto] DEBUG: Retrieving credentials from metadata server.
2016-07-24 02:00:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-07-24 02:00:39 [boto] ERROR: Unable to read instance data, giving up
2016-07-24 02:00:39 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-24 02:00:39 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-24 02:00:39 [scrapy] INFO: Enabled item pipelines: TupianPipeline
2016-07-24 02:00:39 [scrapy] INFO: Spider opened
2016-07-24 02:00:39 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-24 02:00:39 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-24 02:00:39 [scrapy] INFO: Closing spider (finished)
2016-07-24 02:00:39 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 629425),
'log_count/DEBUG': 2,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 626707)}
2016-07-24 02:00:39 [scrapy] INFO: Spider closed (finished)

代码如下：
tupianspider.py文件

#coding:utf-8

import scrapy
from tupian.items import TupianItem

from scrapy.crawler import CrawlerProcess

class tupianSpidier(scrapy.Spider):
name = 'tupian' #该名字为爬虫的独立名字
allowed_domains = ["toumiaola.com"]#图片在该网址范围内
star_urls= ["http://www.toumiaola.com/youtu/"]#图片起始位置

def parse_img(self, response):
      item = TupianItem()
      item['image_urls'] = response.xpath('//img//@src').extract()#提取图片地址
      print'image_urls', item['image_urls']
      yield item
      new_url = ('http://www.toumiaola.com/youtu/' + [response.xpath('//a[@href="list_/w"]')]).extract_first()#寻找下一页
      print'new_url', new_url
      if new_url:
         yield scrapy.Request(new_url, callback=self.parse)

items.py文件
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class TupianItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()

image_urls = scrapy.Field()#图片地址

pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import urllib

from tupian import settings

class TupianPipeline(object):

def process_item(self, item, spider):
      dir_path = '%s%s'%(settings.IMAGES_STORE,spider.name)
      print'dir_path', dir_path
      if not os.path.exists(dir_path):
         os.makedirs(dir_path)
      for image_url in item['image_urls']:
         list_name = image_url.split('/')
         file_path = list_name[len(list_name)-1]#图片名称
         #print 'filename',file_path
         if os.path.exists(list_name):
            continue
         with open(file_path, 'wb')as file_writer:
            conn = urllib.urlopen(image_url)#下载图片
            file_writer.write(conn.read())
         file_writer.close()

      return item

Settings.py文件

BOT_NAME = 'tupian'

SPIDER_MODULES = ['tupian.spiders']
NEWSPIDER_MODULE = 'tupian.spiders'
ITEM_PIPELINES = {
'tupian.pipelines.TupianPipeline':1,
}
IMAGES_STORE='E:'
DOWNLOAD_DELAY=0.25

诚心的请教，希望可以帮忙解答。

最佳答案

月排行榜 / 总排行榜

SixPy

2016-7-25 06:23:32

竞技山发表于 2016-7-25 01:05
python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 [scrapy] INFO: Scrapy 1.0. ...

'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,

302': 1次
503': 3次

jiandan发现你是爬虫，拒绝了

跳转到最佳答案楼层

竞技山 · 发表于 2016-7-24 02:17:44

唉，前期每一步都是深渊，努力爬上高处。

弧矢七 · 发表于 2016-7-24 10:06:24

等我学会了我来帮你解答

竞技山 · 发表于 2016-7-24 11:40:55

弧矢七发表于 2016-7-24 10:06
等我学会了我来帮你解答

好吧。期待~

竞技山 · 发表于 2016-7-24 14:36:21

感觉指向地址也正确。怎么会就结束了呢

竞技山 · 发表于 2016-7-25 01:05:50

python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 [scrapy] INFO: Scrapy 1.0.3 started (bot: jiandan)
2016-07-25 00:02:33 [scrapy] INFO: Optional features available: ssl, http11, boto
2016-07-25 00:02:33 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jiandan.spiders', 'SPIDER_MODULES': ['jiandan.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'jiandan'}
2016-07-25 00:02:33 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-25 00:02:33 [boto] DEBUG: Retrieving credentials from metadata server.
2016-07-25 00:02:34 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 429, in open
response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 447, in _open
'_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
raise URLError(err)
URLError: <urlopen error timed out>
2016-07-25 00:02:34 [boto] ERROR: Unable to read instance data, giving up
2016-07-25 00:02:34 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-25 00:02:34 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-25 00:02:34 [scrapy] INFO: Enabled item pipelines: JiandanPipeline
2016-07-25 00:02:34 [scrapy] INFO: Spider opened
2016-07-25 00:02:34 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-25 00:02:34 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-25 00:02:34 [scrapy] DEBUG: Redirecting (302) to <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> from <GET http://jandan.net/ooxx>
2016-07-25 00:02:34 [scrapy] DEBUG: Retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 1 times): 503 Service Unavailable
2016-07-25 00:02:35 [scrapy] DEBUG: Retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 2 times): 503 Service Unavailable
2016-07-25 00:02:35 [scrapy] DEBUG: Gave up retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 3 times): 503 Service Unavailable
2016-07-25 00:02:35 [scrapy] DEBUG: Crawled (503) <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (referer: None)
2016-07-25 00:02:35 [scrapy] DEBUG: Ignoring response <503 http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx>: HTTP status code is not handled or not allowed
2016-07-25 00:02:35 [scrapy] INFO: Closing spider (finished)
2016-07-25 00:02:35 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1063,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 8516,
'downloader/response_count': 4,
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 24, 16, 2, 35, 655094),
'log_count/DEBUG': 8,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 7, 24, 16, 2, 34, 321404)}
2016-07-25 00:02:35 [scrapy] INFO: Spider closed (finished)

SixPy · 发表于 2016-7-25 06:23:32

竞技山发表于 2016-7-25 01:05
python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 [scrapy] INFO: Scrapy 1.0. ...

'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,

302': 1次
503': 3次

jiandan发现你是爬虫，拒绝了

竞技山 · 发表于 2016-7-25 15:53:18

SixPy 发表于 2016-7-25 06:23
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,

原来代码是这么看的....敞开了世界的大门，我原来还以为只看最下面的INFO。。。学习了。

L丶 · 发表于 2016-12-5 17:20:51

厉害了，学习学习

wxb19840810 · 发表于 2016-12-13 01:33:12

看到这些代码，我瞬间懵逼了。。水平还不如你。。。

流月飞星 · 发表于 2016-12-14 08:43:30

13993793879 · 发表于 2016-12-14 16:16:19

还没有学的这里，完全看不懂

qzssmdx · 发表于 2016-12-14 17:42:46

也在學爬蟲中..同楊等大神恢復,也參考學習

ljmpython · 发表于 2016-12-18 13:57:26

66666

qzssmdx · 发表于 2016-12-18 17:29:24

同樣學習中...

rayleigh_tong · 发表于 2016-12-30 08:34:49

看看

qzssmdx · 发表于 2016-12-30 08:45:54

进来看看了。。

whdd · 发表于 2018-9-19 15:21:26

666 学习

GOD乌索普 · 发表于 2018-9-19 16:52:19

还没看到这里

钱闻韬 · 发表于 2018-9-19 20:10:45

学习

账号		自动登录	找回密码
密码			立即注册

[已解决]爬虫大神求救（百度两天之后还是没有完成）

马上注册，结交更多好友，享用更多功能^_^

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

回帖奖励 +2 鱼币

浏览过的版块