[已解决]我写的scrapy为什么只爬了一页就没有了呀？？

况qiqi · 发表于 2018-4-8 22:14:59

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

下面是我的代码，，大神们帮忙看看哪里不对呀，，谢谢啦！

import scrapy
from scrquote_all.items import ScrquoteAllItem
class QuoteallSpider(scrapy.Spider):
name='quoteall'
start_urls=[
      'http://quotes.toscrape.com/page/1/',
      ]
items=[]
def parse(self,response):
      for quote in response.css('div.quote'):
         item=ScrquoteAllItem()
         item['text']=quote.css('span.text::text').extract_first()
         item['author']=quote.css('small.author::text').extract_first()
         item['tags']=quote.css('div.tags a.tag::text').extract()
         self.items.append(item)
      next_page=response.css('li.next a::attr(href)').extract_first()
      if next_page is not None:
         response.follow(next_page,callback=self.parse)
      return self.items

最佳答案

月排行榜 / 总排行榜

第四时空

2018-4-9 10:40:32

况qiqi 发表于 2018-4-9 09:04
不对呀，，改成这样的话虽然遍历了每一页但是出错了！
2018-04-09 09:00:22 [scrapy.core.scraper] ERRO ...

那估计是因为最后一句返回self.items的是列表。
scrapy无法解析这个对象。
修改成不要返回self.items，而是在for循环在里yield item

跳转到最佳答案楼层

第四时空 · 发表于 2018-4-9 00:42:28

最后三句我感觉应该是这样才对...感觉...

if next_page is not None:
yield scrapy.Request(next_page, callback=self.parse)
yield self.items

复制代码

况qiqi · 发表于 2018-4-9 09:04:02

第四时空发表于 2018-4-9 00:42
最后三句我感觉应该是这样才对...感觉...

不对呀，，改成这样的话虽然遍历了每一页但是出错了！
2018-04-09 09:00:22 [scrapy.core.scraper] ERROR: Spider must return Request, BaseItem, dict or None, got 'list' in <GET http://quotes.toscrape.com/page/8/>

第四时空 · 发表于 2018-4-9 10:40:32

况qiqi 发表于 2018-4-9 09:04
不对呀，，改成这样的话虽然遍历了每一页但是出错了！
2018-04-09 09:00:22 [scrapy.core.scraper] ERRO ...

那估计是因为最后一句返回self.items的是列表。
scrapy无法解析这个对象。
修改成不要返回self.items，而是在for循环在里yield item

况qiqi · 发表于 2018-4-9 10:54:40

第四时空发表于 2018-4-9 10:40
那估计是因为最后一句返回self.items的是列表。
scrapy无法解析这个对象。
修改成不要返回self.items， ...

好的，我试试

况qiqi · 发表于 2018-4-9 11:18:59

第四时空发表于 2018-4-9 10:40
那估计是因为最后一句返回self.items的是列表。
scrapy无法解析这个对象。
修改成不要返回self.items， ...

按照您说的我试了一下，爬取成功了
代码是这样的
import scrapy
from scrquote_all.items import ScrquoteAllItem
class QuoteallSpider(scrapy.Spider):
name='quoteall'
start_urls=[
      'http://quotes.toscrape.com/page/1/',
      ]
def parse(self,response):
      for quote in response.css('div.quote'):
         item=ScrquoteAllItem()
         item['text']=quote.css('span.text::text').extract_first()
         item['author']=quote.css('small.author::text').extract_first()
         item['tags']=quote.css('div.tags a.tag::text').extract()
         yield item
      next_page=response.css('li.next a::attr(href)').extract_first()
      if next_page is not None:
         yield response.follow(next_page,callback=self.parse)

但我还是有一点疑问想请教一下，，、最后返回的仍然是一个大的列表，就是和我最开始想做成的效果是一样的
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
...
]
这是最后返回保存的json文件的格式，，我明明每一次单独返回结果的，为什么最后保存的文件还是一个大的列表呢？

第四时空 · 发表于 2018-4-9 11:42:05

况qiqi 发表于 2018-4-9 11:18
按照您说的我试了一下，爬取成功了
代码是这样的
import scrapy

这个我倒不知道怎么解释了。它就是这样的，每一个item都是一条数据，多个item组成一个item列表。

况qiqi · 发表于 2018-4-9 14:44:38

第四时空发表于 2018-4-9 11:42
这个我倒不知道怎么解释了。它就是这样的，每一个item都是一条数据，多个item组成一个item列表。{:10_245 ...

好的，我再好好学一下，谢谢了

账号		自动登录	找回密码
密码			立即注册

[已解决]我写的scrapy为什么只爬了一页就没有了呀？？

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块