关于scrapy爬不出数据，求助

huangkang · 发表于 2019-10-14 07:09:36

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

学习用scrapy爬 https://www.jobui.com/rank/company/  中四个排行榜前10的数据，如果每个单位只爬取第一页的内容就没问题，但要爬取下一页就有问题了，求大神赐教，看看下面的代码哪里出了问题，谢谢！

import scrapy
import bs4
from ..items import JobuiItem

class JobuiSpider(scrapy.Spider):
name = 'jobs'
allowed_domins = ['https://www.jobui.com']
start_urls = ['https://www.jobui.com/rank/company/']

def parse(self, response):
      bs = bs4.BeautifulSoup(response.text, 'html.parser')
      ul_list = bs.find_all('ul', class_="textList flsty cfix")
      for ul in ul_list:
         a_list = ul.find_all('a')
         for a in a_list:
            company_id = a['href']
            company_url = 'https://www.jobui.com' + company_id + 'jobs/'
            yield scrapy.Request(company_url, callback=self.parser_url_list())

#========================================================
#之前没有添加这段代码，然后上面的  yield scrapy.Request(company_url, callback=self.parser_job) 是没有问题的，能够正常爬取每个单位第一页的数据，后来想说进一步爬取更多的数据，添加了这段代码，就爬取不出来，不知道哪里出了问题
# 生成链接列表
def parser_url_list(self, response):
      while True:
         url_list = []
         bs = bs4.BeautifulSoup(response.text, 'html.parser')
         pages = bs.find('div', class_='pager cfix').find_all('a')
         flag = False  # 判断是否最后一页
         for page in pages:
            if page.text == '下一页':
                  url_list.append('https://www.jobui.com' + page['href'])
                  flag = True
         if flag == False:
            break
      yield scrapy.Request(url_list,callback=self.parse_job())
#=========================================================

# 解析网页
def parse_job(self, response):
      bs = bs4.BeautifulSoup(response.text, 'html.parser')
      company = bs.find(id="companyH1").text
      datas = bs.find_all('div',class_='c-job-list')
      for data in datas:
         item = JobuiItem()
         item['company'] = company
         item['position'] = data.find('div',class_='job-segmetation').text.strip()
         item['address'] = data.find_all('span')[0].text.strip()
         item['detail'] = data.find_all('span')[1].text.strip()
         yield item

yuweb · 发表于 2019-10-14 08:54:42

第一页https://www.jobui.com/rank/company/view/foshan/?n=1
第二页https://www.jobui.com/rank/company/view/foshan/?n=2
。。。
n后面带的参数就是页面

huangkang · 发表于 2019-10-14 11:22:05

yuweb 发表于 2019-10-14 08:54
第一页https://www.jobui.com/rank/company/view/foshan/?n=1
第二页https://www.jobui.com/rank/company/ ...

哦哦，链接页面已经找出来了

账号		自动登录	找回密码
密码			立即注册