[已解决]第一页趴下来了，为什么不爬第二页

923204485 · 发表于 2018-10-28 21:08:17

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

怎么不爬第二页啊是不是我yield缩进问题，求大佬看看

# -*- coding: utf-8 -*-
import scrapy
from Tengxun.items import TengxunItem
class TengxunSpider(scrapy.Spider):
name = 'tengxun'
allowed_domains = ['https://hr.tencent.com']
baseURL = 'https://hr.tencent.com/position.php?&start='
offset = 0
start_urls = [baseURL + str(offset)]
def parse(self, response):
node_list = response.xpath('//tr[@class="even"] | //tr[@class="odd"]')
for node in node_list:
item = TengxunItem()
item['positionName'] = node.xpath('.//a/text()').extract()[0]#职称
item['positionLink'] = node.xpath('.//a/@href').extract()[0]#详情
if len(node.xpath('./td[2]/text()')):
item['positionType'] = node.xpath('./td[2]/text()').extract()[0]#类别
else:
item['positionType'] = ' '
item['peopleNumber'] = node.xpath('./td[3]/text()').extract()[0]#人数
item['workLocation'] = node.xpath('./td[4]/text()').extract()[0]#地点
item['publishTime'] = node.xpath('./td[5]/text()').extract()[0]#.encode('utf-8')
yield item #for循环取每一个值并返回，取完六个值再执行下面if
if self.offset < 2950:
self.offset += 10
#offset值为整型，字符串相加要转换成str类型
url = self.baseURL + str(self.offset)
#callback = self.parse 指定函数
yield scrapy.Request(url,callback = self.parse) #访问第二页并将参数带入进入for循环

复制代码

cmd 执行到翻页代码返回信息，也没报错就是不知道为什么没爬第二页
2018-10-28 21:01:45 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'hr.tencent.com': <GET https://hr.tencent.com/position.php&start=10>
2018-10-28 21:01:45 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-28 21:01:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

最佳答案

月排行榜 / 总排行榜

wongyusing

2018-10-28 22:10:09

本帖最后由 wongyusing 于 2018-10-28 22:14 编辑

你这个代码是看网上的scrapy的教材对吧？
我没记错的话后面是一个for，没有if吧。
还有个可能性是每次不是加10吧。
你去看下页面每次加多少。

还有个可能性，由于某个岗位缺少某个参数导致报错，使爬虫中断

跳转到最佳答案楼层

923204485 · 发表于 2018-10-28 21:14:30

cmd返回信息翻译了下，

2018 - 10 - 28 21 : 12 : 17 [剪贴簿.蜘蛛中间件.场外]调试:过滤掉对“HR . tencen”的场外请求《获得https://hr.tencent.com/position.php?》& start = 10 >
2018 - 10 - 28 21 : 12 : 17 [剪贴簿。核心。引擎]信息:关闭蜘蛛(完成)信息:倾倒垃圾统计数据:

wongyusing · 发表于 2018-10-28 22:10:09

这个最佳答案由 wongyusing 给出，感谢 wongyusing 的回答。

单击隐藏图章

本帖最后由 wongyusing 于 2018-10-28 22:14 编辑

你这个代码是看网上的scrapy的教材对吧？
我没记错的话后面是一个for，没有if吧。
还有个可能性是每次不是加10吧。
你去看下页面每次加多少。

还有个可能性，由于某个岗位缺少某个参数导致报错，使爬虫中断

923204485 · 发表于 2018-10-28 22:36:17

wongyusing 发表于 2018-10-28 22:10
你这个代码是看网上的scrapy的教材对吧？
我没记错的话后面是一个for，没有if吧。
还有个可能性是每次 ...

好了已经，是我的主域名问题，限制了爬虫跳转网页了

923204485 · 发表于 2018-10-28 22:36:48

wongyusing 发表于 2018-10-28 22:10
你这个代码是看网上的scrapy的教材对吧？
我没记错的话后面是一个for，没有if吧。
还有个可能性是每次 ...

还是很谢谢你了回复

账号		自动登录	找回密码
密码			立即注册

[已解决]第一页趴下来了，为什么不爬第二页

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块