猎聘爬虫循环爬取信息中途出错
1.我用了playwright爬取猎聘的腾讯公司的招聘信息,下面时我的程序的前半部分。问题:在爬取的时候能爬取列表中的前两条信息,到第三条的时候显示Keyerro错误如下:
Traceback (most recent call last):
File "C:\Users\AAA\PycharmProjects\pythonProject\爬虫\猎聘\猎聘爬虫.py", line 163, in <module>
spiders_liepin(page_sta, page_end)
File "C:\Users\AAA\PycharmProjects\pythonProject\爬虫\猎聘\猎聘爬虫.py", line 69, in spiders_liepin
data_promid = a.find('a').attrs['data-promid']
KeyError: 'data-promid'
我回去检测网页代码的时候发现关键词时对的,这是为什么呢?怎么解决啊
# 导入关键词--------------------------------------------------------------------------------------------------------------------------------------------
for i in col:
kw.append(i.value)
for k in kw:
page1.goto('https://c.liepin.com/',
wait_until='domcontentloaded')
page1.wait_for_timeout(5000)
page1.get_by_placeholder("搜索职位/公司/内容关键词").fill(k)
page1.get_by_placeholder("搜索职位/公司/内容关键词").press("Enter")
page1.wait_for_timeout(1000)
#判断关键词——————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
if page1.get_by_role('button',name='进入公司主页').is_enabled():
with page1.expect_popup() as popup_info:
page1.get_by_role('button',name='进入公司主页').click()
page2 = popup_info.value
page2.wait_for_timeout(1000)
p2 = BeautifulSoup(page2.content(), 'html.parser')
joblist_link = p2.find('div',attrs={'class':'company-header-content-tab clearfix'}).find('a').find_next('a').attrs['href']
page2.goto(joblist_link)
page2.wait_for_timeout(5000)
#翻页---------------------------------------------------------------------------------------------------------------------------------------
for page_num in range(page_sta,page_end):
p2 = BeautifulSoup(page2.content(), 'html.parser')
for a in p2.find_all('div',attrs={'class':'job-detail-box'}):
href = a.find('a').attrs['href']
data_promid = a.find('a').attrs['data-promid']
detail_page = href +'?'+data_promid
# -----------提取详情页信息--------------------------------------------------------------------------------------------------
page3.goto(detail_page)
page3.wait_for_timeout(5000)
d = BeautifulSoup(page3.content(), "html.parser")
job_name = d.find('span', attrs={'class': 'name ellipsis-1'}).text
job_comp = k
job_salary = d.find('span', attrs={'class': 'salary'}).text
detail = d.find('dd', attrs={'data-selector': 'job-intro-content'}).text
detail_tag = d.find('div', attrs={'class': 'job-properties'}).find_all('span')
detail_sum = d.find('div', attrs={'class': 'tag-box'}).text
tag = []
data =
for t in detail_tag:
tag.append(t.text)
data.extend(tag)
# ---------------保存数据-----------------------------------------------------------------------------------------------------------------
sheet.append(data)
Data.append(data)
book.save('./请将数据另存为,并清空此文件内容.xlsx')
print('第' + str(page_num) + '页',len(Data), job_name, job_comp, job_salary, detail_page, detail,
detail_tag, detail_sum)
#判断尾页------------------------------------------------------------------------------------------------
if page2.get_by_role("button", name="right").is_enabled():
page2.get_by_role("button", name="right").click()
page2.wait_for_timeout(1000)
continue
else:
print('第' + str(page_num) + '页,已到最后一页------------------------------------------------------------------------')
break
# —--运行完成后输出结果----------------------------------------------------------------------------------------------------------------------------------
print('写入数据条数:', len(Data)) 确实是关键词错了,网站登录和未登录状态的关键词不一样
页:
[1]