关于scrapy爬虫框架中CrawlSpiders的一些问题
想用scrapy框架爬取https://www.xbiquge.la/fenlei/3_1.html下面的好看的都市小说最近更新列表可以翻页的内容,包括左边的书名,右边的最新更新的章节名,以及右边的最新更新的章节名点进去的更新的内容,必须用-t crawl的方式爬-----------------------------zuixin.py--------------------------------------
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from ..items import BiqugeItem
class ZuixinSpider(CrawlSpider):
name = 'zuixin'
allowed_domains = ['xbiquge.la']
start_urls = ['https://www.xbiquge.la/dushixiaoshuo/']
rules = (
Rule(LinkExtractor(restrict_xpaths='//a'), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths='//div[@id="newscontent"]/div/ul/li/span/a'), callback='second_parse', follow=False),
)
def parse_item(self, response):
self.item = BiqugeItem()
self.item['bookname'] = response.xpath('//div[@id="newscontent"]/div/ul/li/span/a/text()').getall()
self.item['title'] = response.xpath('//div[@id="newscontent"]/div/ul/li/span/a/text()').getall()
# print(item['bookname'])
# print(item['title'])
def second_parse(self, response):
self.item['contents'] = response.xpath('//div[@id="content"]/text()').getall()
# print(item['contents'])
yield self.item
-----------------------------------pipelines.py----------------------------------------
from itemadapter import ItemAdapter
import os
class BiqugePipeline:
def open_spider(self, spider):
if not os.path.exists('最新更新'):
os.mkdir('最新更新')
def process_item(self, item, spider):
print(item)
for node in zip(item['bookname'], item['title'], item['contents']):
with open(fr'./最新更新/{node}_{node}.txt', 'w', encoding='utf-8')as f:
f.write(''.join(node))
return item
----------------------------------------------------------------------------------------
点进去的内容是不正确的,我不知道在CrawlSpiders中如何把parse_item中的item传递到second_parse中去,索性就用了self.item, 结果肯定是不对的, 应该怎么处理呢, 请大佬们多多指教
页:
[1]