scrapy问题,Python交流,编程语言专区,鱼C论坛

黑盒子 发表于 2020-9-5 21:02:43

scrapy问题

为啥抓取不了呢，xpath都可以定位到，为啥就不能抓取到，什么情况呢
腾讯招聘网https://careers.tencent.com/search.html?index=1
# tencent.py
# 爬取项目文件
import scrapy
from Tencent.items import TencentItem

class TencentSpider(scrapy.Spider):
name = 'tencent'
#allowed_domains = ['tencent.com']
offsets = 1
baseUrl = 'https://careers.tencent.com/search.html'
index = '?index='
start_urls = [ baseUrl + index + str(offsets)]

def parse(self, response):
   node_list = response.xpath('//div[@class="recruit-list"]')

   for node in node_list:
         item = TencentItem()
         item['positionName'] = node.xpath('./a/h4/text()').extract().encode('utf-8')
         item['workLocation'] = node.xpath('./a/span/text()').extract().encode('utf-8')
         item['positionType'] = node.xpath('./a/span/text()').extract().encode('utf-8')
         item['publishTimes'] = node.xpath('./a/span/text()').extract().encode('utf-8')
         yield item

   if self.offsets < 2:
         self.offsets += 1
         url = self.baseUrl + self.index + str(self.offsets)
         yield scrapy.Request(url, callback=self.parse)

这是部分文件，用的scrapy，但是就是没有数据，为啥子呢，xpath在网页上面可以定位到，求教

YunGuo 发表于 2020-9-7 18:36:56

你是直接使用浏览器插件在浏览器上定位的吧？你如果去看这个网址的response里面，是没有任何数据的，一般这种网站数据应该都是通过js加载的。直接xpath是取不到任何数据的。解决方法就是去抓包分析，去找找看有没有请求json数据，然后直接请求json数据的url去获取其中的数据。

疾风怪盗 发表于 2020-9-7 18:45:34

https://careers.tencent.com/tencentcareer/api/post/Query?timestamp=1599475401341&countryId=&cityId=&bgIds=&productId=&categoryId=&parentCategoryId=&attrId=&keyword=&pageIndex=1&pageSize=10&language=zh-cn&area=cn

在network里找一下这个，职位数据都在这个里面，json格式

页: [1]

鱼C论坛's Archiver

scrapy问题