|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
1.我用了playwright爬取猎聘的腾讯公司的招聘信息,下面时我的程序的前半部分。
问题:在爬取的时候能爬取列表中的前两条信息,到第三条的时候显示Keyerro错误如下:
Traceback (most recent call last):
File "C:\Users\AAA\PycharmProjects\pythonProject\爬虫\猎聘\猎聘爬虫.py", line 163, in <module>
spiders_liepin(page_sta, page_end)
File "C:\Users\AAA\PycharmProjects\pythonProject\爬虫\猎聘\猎聘爬虫.py", line 69, in spiders_liepin
data_promid = a.find('a').attrs['data-promid']
KeyError: 'data-promid'
我回去检测网页代码的时候发现关键词时对的,这是为什么呢?怎么解决啊
- # 导入关键词--------------------------------------------------------------------------------------------------------------------------------------------
- for i in col:
- kw.append(i.value)
- for k in kw:
- page1.goto('https://c.liepin.com/',
- wait_until='domcontentloaded')
- page1.wait_for_timeout(5000)
- page1.get_by_placeholder("搜索职位/公司/内容关键词").fill(k)
- page1.get_by_placeholder("搜索职位/公司/内容关键词").press("Enter")
- page1.wait_for_timeout(1000)
- #判断关键词——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
- if page1.get_by_role('button',name='进入公司主页').is_enabled():
- with page1.expect_popup() as popup_info:
- page1.get_by_role('button',name='进入公司主页').click()
- page2 = popup_info.value
- page2.wait_for_timeout(1000)
- p2 = BeautifulSoup(page2.content(), 'html.parser')
- joblist_link = p2.find('div',attrs={'class':'company-header-content-tab clearfix'}).find('a').find_next('a').attrs['href']
- page2.goto(joblist_link)
- page2.wait_for_timeout(5000)
- #翻页---------------------------------------------------------------------------------------------------------------------------------------
- for page_num in range(page_sta,page_end):
- p2 = BeautifulSoup(page2.content(), 'html.parser')
- for a in p2.find_all('div',attrs={'class':'job-detail-box'}):
- href = a.find('a').attrs['href']
- data_promid = a.find('a').attrs['data-promid']
- detail_page = href +'?'+data_promid
- # -----------提取详情页信息--------------------------------------------------------------------------------------------------
- page3.goto(detail_page)
- page3.wait_for_timeout(5000)
- d = BeautifulSoup(page3.content(), "html.parser")
- job_name = d.find('span', attrs={'class': 'name ellipsis-1'}).text
- job_comp = k
- job_salary = d.find('span', attrs={'class': 'salary'}).text
- detail = d.find('dd', attrs={'data-selector': 'job-intro-content'}).text
- detail_tag = d.find('div', attrs={'class': 'job-properties'}).find_all('span')
- detail_sum = d.find('div', attrs={'class': 'tag-box'}).text
- tag = []
- data = [job_name, job_comp, job_salary, detail_page, detail, detail_sum]
- for t in detail_tag:
- tag.append(t.text)
- data.extend(tag)
- # ---------------保存数据-----------------------------------------------------------------------------------------------------------------
- sheet.append(data)
- Data.append(data)
- book.save('./请将数据另存为,并清空此文件内容.xlsx')
- print('第' + str(page_num) + '页',len(Data), job_name, job_comp, job_salary, detail_page, detail,
- detail_tag, detail_sum)
- #判断尾页------------------------------------------------------------------------------------------------
- if page2.get_by_role("button", name="right").is_enabled():
- page2.get_by_role("button", name="right").click()
- page2.wait_for_timeout(1000)
- continue
- else:
- print('第' + str(page_num) + '页,已到最后一页------------------------------------------------------------------------')
- break
- # —--运行完成后输出结果----------------------------------------------------------------------------------------------------------------------------------
- print('写入数据条数:', len(Data))
复制代码 |
|