鱼C论坛

 找回密码
 立即注册
查看: 2196|回复: 2

[已解决]用scrapy框架爬花瓣

[复制链接]
发表于 2016-9-30 22:34:07 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
学了scrapy后,参照网上的scrapy爬煎蛋写了个爬花瓣的,但一直爬不出来,找了好久原因都没找到,大神帮我看一下吧:
huaban_spider.py
  1. import scrapy
  2. from huaban.items import HuabanItem
  3. from scrapy.crawler import CrawlerProcess

  4. class HuabanSpider(scrapy.spiders.Spider):
  5.     name = "huaban"
  6.     allow_domains = ["huaban.com"]
  7.     start_urls = ["http://huaban.com/pins/872365966"]
  8.     def parse(self,response):
  9.         item = HuabanItem()
  10.         item['image_urls'] = response.xpath('//div[@class="main-image"]//div/a/img/@src').extract()
  11.         print('image_urls',item['image_urls'])
  12.         yield item
  13.         new_url = response.xpath('//div[@id="board_pins_waterfall"]/a/@href').extract_first()
  14.         if new_url:
  15.             yield scrapy.Request(new_url,callback=self.parse)
复制代码

pipelines.py
  1. # -*- coding: utf-8 -*-

  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
  6. import os
  7. import urllib

  8. from huaban import settings

  9. class HuabanPipeline(object):
  10.     def process_item(self,item,spider):
  11.         dir_path = '%s/%s'%(settings.IMAGES_STORE,spider.name)
  12.         print("dir_path",dir_path)
  13.         if not os.path.exists(dir_path):
  14.             os.makedirs(dir_path)
  15.         for image_url in item['image_urls']:
  16.             list_name = image_url.split('/')
  17.             file_name = list_name[len(list_name)-1]
  18.             file_path = '%s/%s'%(dir_path,file_name)
  19.             if os.path.exists(file_name):
  20.                 continue
  21.             with open(file_path,'wb') as file_writer:
  22.                 conn = urllib.request.urlopen(image_url)
  23.                 file_write.write(conn.read())
  24.             file_writer.close()
  25.         return item
复制代码

items.py
  1. # -*- coding: utf-8 -*-

  2. # Define here the models for your scraped items
  3. #
  4. # See documentation in:
  5. # http://doc.scrapy.org/en/latest/topics/items.html

  6. import scrapy


  7. class HuabanItem(scrapy.Item):
  8.     # define the fields for your item here like:
  9.     # name = scrapy.Field()
  10.     image_urls = scrapy.Field()
复制代码

settings.py
  1. # -*- coding: utf-8 -*-

  2. # Scrapy settings for huaban project
  3. #
  4. # For simplicity, this file contains only settings considered important or
  5. # commonly used. You can find more settings consulting the documentation:
  6. #
  7. #     http://doc.scrapy.org/en/latest/topics/settings.html
  8. #     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  9. #     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

  10. BOT_NAME = 'huaban'

  11. SPIDER_MODULES = ['huaban.spiders']
  12. NEWSPIDER_MODULE = 'huaban.spiders'


  13. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  14. #USER_AGENT = 'huaban (+http://www.yourdomain.com)'

  15. # Obey robots.txt rules
  16. ROBOTSTXT_OBEY = True

  17. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  18. #CONCURRENT_REQUESTS = 32

  19. # Configure a delay for requests for the same website (default: 0)
  20. # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
  21. # See also autothrottle settings and docs
  22. #DOWNLOAD_DELAY = 3
  23. # The download delay setting will honor only one of:
  24. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  25. #CONCURRENT_REQUESTS_PER_IP = 16

  26. # Disable cookies (enabled by default)
  27. #COOKIES_ENABLED = False

  28. # Disable Telnet Console (enabled by default)
  29. #TELNETCONSOLE_ENABLED = False

  30. # Override the default request headers:
  31. #DEFAULT_REQUEST_HEADERS = {
  32. #   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  33. #   'Accept-Language': 'en',
  34. #}

  35. # Enable or disable spider middlewares
  36. # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
  37. #SPIDER_MIDDLEWARES = {
  38. #    'huaban.middlewares.MyCustomSpiderMiddleware': 543,
  39. #}

  40. # Enable or disable downloader middlewares
  41. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
  42. #DOWNLOADER_MIDDLEWARES = {
  43. #    'huaban.middlewares.MyCustomDownloaderMiddleware': 543,
  44. #}

  45. # Enable or disable extensions
  46. # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
  47. #EXTENSIONS = {
  48. #    'scrapy.extensions.telnet.TelnetConsole': None,
  49. #}

  50. # Configure item pipelines
  51. # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
  52. ITEM_PIPELINES = {
  53.     'huaban.pipelines.HuabanPipeline': 1,
  54. }
  55. IMAGES_STORE = 'D:'
  56. DOWNLOAD_DELAY = 0.25
  57. # Enable and configure the AutoThrottle extension (disabled by default)
  58. # See http://doc.scrapy.org/en/latest/topics/autothrottle.html
  59. #AUTOTHROTTLE_ENABLED = True
  60. # The initial download delay
  61. #AUTOTHROTTLE_START_DELAY = 5
  62. # The maximum download delay to be set in case of high latencies
  63. #AUTOTHROTTLE_MAX_DELAY = 60
  64. # The average number of requests Scrapy should be sending in parallel to
  65. # each remote server
  66. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  67. # Enable showing throttling stats for every response received:
  68. #AUTOTHROTTLE_DEBUG = False

  69. # Enable and configure HTTP caching (disabled by default)
  70. # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  71. #HTTPCACHE_ENABLED = True
  72. #HTTPCACHE_EXPIRATION_SECS = 0
  73. #HTTPCACHE_DIR = 'httpcache'
  74. #HTTPCACHE_IGNORE_HTTP_CODES = []
  75. #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
复制代码



爬出来的结果是这样的
C:\Users\Administrator\huaban>scrapy crawl huaban
2016-09-30 22:24:52 [scrapy] INFO: Scrapy 1.1.3 started (bot: huaban)
2016-09-30 22:24:52 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 0.25,
'SPIDER_MODULES': ['huaban.spiders'], 'BOT_NAME': 'huaban', 'ROBOTSTXT_OBEY': T
rue, 'NEWSPIDER_MODULE': 'huaban.spiders'}
2016-09-30 22:24:52 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole']
2016-09-30 22:24:53 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-09-30 22:24:53 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-09-30 22:24:53 [scrapy] INFO: Enabled item pipelines:
['huaban.pipelines.HuabanPipeline']
2016-09-30 22:24:53 [scrapy] INFO: Spider opened
2016-09-30 22:24:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 i
tems (at 0 items/min)
2016-09-30 22:24:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-30 22:24:53 [scrapy] DEBUG: Crawled (200) <GET http://huaban.com/robots.
txt> (referer: None)
2016-09-30 22:24:54 [scrapy] DEBUG: Crawled (200) <GET http://huaban.com/pins/87
2365966> (referer: None)
image_urls []
dir_path D:/huaban
2016-09-30 22:24:54 [scrapy] DEBUG: Scraped from <200 http://huaban.com/pins/872
365966>
{'image_urls': []}
2016-09-30 22:24:54 [scrapy] INFO: Closing spider (finished)
2016-09-30 22:24:54 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 440,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16658,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 9, 30, 14, 24, 54, 540733),
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 9, 30, 14, 24, 53, 674683)}
2016-09-30 22:24:54 [scrapy] INFO: Spider closed (finished)
最佳答案
2016-10-3 11:58:58
'downloader/response_status_count/200': 2,

成功2次了阿
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2016-10-2 22:22:19 | 显示全部楼层
求助啊!!来个大神呀!!!
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2016-10-3 11:58:58 | 显示全部楼层    本楼为最佳答案   
'downloader/response_status_count/200': 2,

成功2次了阿
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2026-2-23 02:13

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表