竞技山 发表于 2016-7-24 02:10:57

爬虫大神求救(百度两天之后还是没有完成)

最近学习了小甲鱼的OOXX之后,一直想要实践,但是不是采用SCRAPY框架的方法。而我想要使用Scrapy框架进行爬取。我就根据63章节来进行了爬虫图片练习。
但是,总是爬不成功。希望各位前辈帮忙指导一下。
执行后显示的代码:


python@ubuntu:~/tupian$ scrapy crawl tupian
2016-07-24 02:00:38 INFO: Scrapy 1.0.3 started (bot: tupian)
2016-07-24 02:00:38 INFO: Optional features available: ssl, http11, boto
2016-07-24 02:00:38 INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tupian.spiders', 'SPIDER_MODULES': ['tupian.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'tupian'}
2016-07-24 02:00:38 INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-24 02:00:38 DEBUG: Retrieving credentials from metadata server.
2016-07-24 02:00:39 ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2016-07-24 02:00:39 ERROR: Unable to read instance data, giving up
2016-07-24 02:00:39 INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-24 02:00:39 INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-24 02:00:39 INFO: Enabled item pipelines: TupianPipeline
2016-07-24 02:00:39 INFO: Spider opened
2016-07-24 02:00:39 INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-24 02:00:39 DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-24 02:00:39 INFO: Closing spider (finished)
2016-07-24 02:00:39 INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 629425),
'log_count/DEBUG': 2,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'start_time': datetime.datetime(2016, 7, 23, 18, 0, 39, 626707)}
2016-07-24 02:00:39 INFO: Spider closed (finished)

代码如下:
tupianspider.py文件

#coding:utf-8

import scrapy
from tupian.items import TupianItem

from scrapy.crawler import CrawlerProcess

class tupianSpidier(scrapy.Spider):
    name = 'tupian' #该名字为爬虫的独立名字
    allowed_domains = ["toumiaola.com"]#图片在该网址范围内
    star_urls= ["http://www.toumiaola.com/youtu/"]#图片起始位置

    def parse_img(self, response):
      item = TupianItem()
      item['image_urls'] = response.xpath('//img//@src').extract()#提取图片地址
      print'image_urls', item['image_urls']
      yield item
      new_url = ('http://www.toumiaola.com/youtu/' + ')]).extract_first()#寻找下一页
      print'new_url', new_url
      if new_url:
            yield scrapy.Request(new_url, callback=self.parse)



items.py文件
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TupianItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    image_urls = scrapy.Field()#图片地址


pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import os
import urllib

from tupian import settings


class TupianPipeline(object):

    def process_item(self, item, spider):
      dir_path = '%s%s'%(settings.IMAGES_STORE,spider.name)
      print'dir_path', dir_path
      if not os.path.exists(dir_path):
            os.makedirs(dir_path)
      for image_url in item['image_urls']:
            list_name = image_url.split('/')
            file_path = list_name#图片名称
            #print 'filename',file_path
            if os.path.exists(list_name):
                continue
            with open(file_path, 'wb')as file_writer:
                conn = urllib.urlopen(image_url)#下载图片
                file_writer.write(conn.read())
            file_writer.close()


      return item

Settings.py文件


BOT_NAME = 'tupian'

SPIDER_MODULES = ['tupian.spiders']
NEWSPIDER_MODULE = 'tupian.spiders'
ITEM_PIPELINES = {
    'tupian.pipelines.TupianPipeline':1,
}
IMAGES_STORE='E:'
DOWNLOAD_DELAY=0.25


诚心的请教,希望可以帮忙解答。

竞技山 发表于 2016-7-24 02:17:44

唉,前期每一步都是深渊,努力爬上高处。

弧矢七 发表于 2016-7-24 10:06:24

等我学会了我来帮你解答

竞技山 发表于 2016-7-24 11:40:55

弧矢七 发表于 2016-7-24 10:06
等我学会了我来帮你解答

好吧。期待~

竞技山 发表于 2016-7-24 14:36:21

感觉指向地址也正确。怎么会就结束了呢

竞技山 发表于 2016-7-25 01:05:50

python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 INFO: Scrapy 1.0.3 started (bot: jiandan)
2016-07-25 00:02:33 INFO: Optional features available: ssl, http11, boto
2016-07-25 00:02:33 INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jiandan.spiders', 'SPIDER_MODULES': ['jiandan.spiders'], 'DOWNLOAD_DELAY': 0.25, 'BOT_NAME': 'jiandan'}
2016-07-25 00:02:33 INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-07-25 00:02:33 DEBUG: Retrieving credentials from metadata server.
2016-07-25 00:02:34 ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 429, in open
    response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 447, in _open
    '_open', req)
File "/usr/lib/python2.7/urllib2.py", line 407, in _call_chain
    result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1228, in http_open
    return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1198, in do_open
    raise URLError(err)
URLError: <urlopen error timed out>
2016-07-25 00:02:34 ERROR: Unable to read instance data, giving up
2016-07-25 00:02:34 INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-07-25 00:02:34 INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-07-25 00:02:34 INFO: Enabled item pipelines: JiandanPipeline
2016-07-25 00:02:34 INFO: Spider opened
2016-07-25 00:02:34 INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-07-25 00:02:34 DEBUG: Telnet console listening on 127.0.0.1:6023
2016-07-25 00:02:34 DEBUG: Redirecting (302) to <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> from <GET http://jandan.net/ooxx>
2016-07-25 00:02:34 DEBUG: Retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 1 times): 503 Service Unavailable
2016-07-25 00:02:35 DEBUG: Retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 2 times): 503 Service Unavailable
2016-07-25 00:02:35 DEBUG: Gave up retrying <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (failed 3 times): 503 Service Unavailable
2016-07-25 00:02:35 DEBUG: Crawled (503) <GET http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx> (referer: None)
2016-07-25 00:02:35 DEBUG: Ignoring response <503 http://jandan.net/block.php?from=http%3A%2F%2Fjandan.net%2Fooxx>: HTTP status code is not handled or not allowed
2016-07-25 00:02:35 INFO: Closing spider (finished)
2016-07-25 00:02:35 INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1063,
'downloader/request_count': 4,
'downloader/request_method_count/GET': 4,
'downloader/response_bytes': 8516,
'downloader/response_count': 4,
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 7, 24, 16, 2, 35, 655094),
'log_count/DEBUG': 8,
'log_count/ERROR': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'start_time': datetime.datetime(2016, 7, 24, 16, 2, 34, 321404)}
2016-07-25 00:02:35 INFO: Spider closed (finished)

SixPy 发表于 2016-7-25 06:23:32

竞技山 发表于 2016-7-25 01:05
python@ubuntu:~/tupian/jiandan$ scrapy crawl jiandan
2016-07-25 00:02:33 INFO: Scrapy 1.0. ...

'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,

302': 1次
503': 3次

jiandan发现你是爬虫,拒绝了

竞技山 发表于 2016-7-25 15:53:18

SixPy 发表于 2016-7-25 06:23
'downloader/response_status_count/302': 1,
'downloader/response_status_count/503': 3,



原来代码是这么看的....敞开了世界的大门,我原来还以为只看最下面的INFO。。。学习了。

L丶 发表于 2016-12-5 17:20:51

厉害了,学习学习

wxb19840810 发表于 2016-12-13 01:33:12

看到这些代码,我瞬间懵逼了。。水平还不如你。。。

流月飞星 发表于 2016-12-14 08:43:30

{:10_256:}

13993793879 发表于 2016-12-14 16:16:19

还没有学的这里,完全看不懂

qzssmdx 发表于 2016-12-14 17:42:46

也在學爬蟲中..同楊等大神恢復,也參考學習

ljmpython 发表于 2016-12-18 13:57:26

66666

qzssmdx 发表于 2016-12-18 17:29:24

同樣學習中...

rayleigh_tong 发表于 2016-12-30 08:34:49

{:10_254:}看看

qzssmdx 发表于 2016-12-30 08:45:54

进来看看了。。

whdd 发表于 2018-9-19 15:21:26

666   学习

GOD乌索普 发表于 2018-9-19 16:52:19

还没看到这里

钱闻韬 发表于 2018-9-19 20:10:45

学习
页: [1] 2
查看完整版本: 爬虫大神求救(百度两天之后还是没有完成)