|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
想要通过playwright模块对boss直聘进行爬虫
目前的问题:
1.在跑代码,登录boss直聘的时候,登录后的主页面一直刷新,以前在用八爪鱼爬boss的时候也是这样,爬到一半,在某一个页面一直刷新,这有办法解决嘛?
2.下面的代码是从我已经写好的智联招聘的爬虫代码copy,修改过来的,但是在boss直聘上跑的时候却读取不了任何数据,这是为什么啊?
- import time
- from playwright.sync_api import sync_playwright
- import playwright
- from bs4 import BeautifulSoup
- import json
- import openpyxl
- def spiders():
- # #获取cookie
- # browser = playwright.chromium.launch(headless=False)
- # context = browser.new_context()
- # page = context.new_page()
- # page.goto('https://login.zhipin.com/?ka=header-login')
- # page.get_by_role("link", name="验证码登录").click()
- # page.get_by_role("textbox", name="手机号").fill("15521292132")
- # time.sleep(50)
- # #保存cookies
- # cookies = context.cookies()
- # with open('boss直聘cookies.json', 'w') as c:
- # c.write(json.dumps(cookies))
- # context.close()
- # browser.close()
- # #重新打开浏览器
- # #加载cookies
- with open('boss直聘cookies.json', 'r') as r:
- load_cookies = json.loads(r.read())
- browser = playwright.chromium.launch(headless=False)
- context = browser.new_context()
- context.add_cookies(load_cookies)
- page1 = context.new_page()
- page2 = context.new_page()
- page1.goto('https://www.zhipin.com/web/geek/job?query=VR&city=101240100&page=1')
- time.sleep(5)
- #翻页
- data = []
- for page_num in range(1,2):
- page1.goto(f'https://www.zhipin.com/web/geek/job?query=VR&city=101240100&page={page_num}')
- #提取数据,获取列表
- job_list = BeautifulSoup(page1.content(), "html.parser")
- time.sleep(5)
- for i in job_list.find_all('ul', attrs={'class':'job-list-box'}):
- job_name = i.find('span', attrs={'class':'job-name'}).text
- job_area = i.find('span', attrs={'class':'class="job-area'}).text
- job_comp = i.find('span', attrs={'ka':"search_list_company_1_custompage"}).text
- job_salary = i.find('span', attrs={'class':'salary'}).text
- detail_page = i.find('a').attrs['href']
- # # 提取详情页信息
- # page2.goto(url=detail_page)
- # time.sleep(1)
- # detail_content = BeautifulSoup(page2.content(), "html.parser")
- # detail = detail_content.find('div', attrs={'class': 'describtion'}).text
- data.append([job_name, job_area,job_comp, job_salary, detail_page,])
- print(job_name, job_area,job_comp, job_salary, detail_page, )
- #储存数据
- book = openpyxl.load_workbook('boss直聘.xlsx')
- sheet = book['boss']
- for row in data:
- sheet.append(row)
- print('写入数据条数:',len(data))
- book.save('boss直聘.xlsx')
- with sync_playwright() as playwright:
- spiders()
复制代码
本帖最后由 cflying 于 2022-11-29 18:51 编辑
- page1.goto(f'https://www.zhipin.com/web/geek/job?query=VR&city=101240100&page={}'.format(page_num))
复制代码
看了眼,刷新应该是网站的安全验证,
至于为啥没数据,那是因为页面还没加载出来你就下一步了,要么设置sleep等几秒,要么等待某个元素出来后再下一步,LZ再研究研究
对了,最后记得加close
最后,遇到问题要想程序运行的逻辑,然后一步一步找原因,就像这个一样,如果是刷新,至少应该打开界面看看人家刷新了啥地址吧,爬不到数据至少返回去让代码返回一个响应页面源码看看呗,要是源码里都没内容,那还爬个啥
|
|