鱼C论坛

 找回密码
 立即注册
查看: 1918|回复: 3

[已解决]爬取对应网页的各个简历模版下载,只能下载其中一个简历模版地址,而不是所有

[复制链接]
发表于 2020-12-22 03:45:06 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
  1. import requests
  2. from lxml import etree

  3. if __name__=="__main__":
  4.     #Ua伪装
  5.     headers = {
  6.             "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"
  7.         }
  8.     #获取页面地址
  9.     list_url = []
  10.     url_1 = "https://sc.chinaz.com/jianli/free.html"
  11.     list_url.append(url_1)
  12.     for i in range(2,3):
  13.         url_n = "https://sc.chinaz.com/jianli/free_{}.html".format(i)
  14.         list_url.append(url_n)

  15.     for url in list_url:
  16.         page_text = requests.get(url=url,headers=headers).text
  17.         tree = etree.HTML(page_text)
  18.         jianli_div = tree.xpath("//div[@id='main']/div/div")
  19.         for a in jianli_div:
  20.             jianli_web = 'https:' + a.xpath("./a/@href")[0]
  21.         
  22.     #简历模版持久化存储
  23.         jianli_text = requests.get(url=jianli_web,headers=headers).text
  24.         tree = etree.HTML(jianli_text)

  25.         download = tree.xpath("//ul[@class='clearfix']/li")
  26.         for a in download:
  27.             download_web = a.xpath("./a/@href")[0]
  28.             print(download_web)




复制代码



代码如上:运行结果,/Users/hudandan/PycharmProjects/爬数据/venv/bin/python /Users/hudandan/PycharmProjects/爬数据/简历模块爬取.py
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14213.rar
/jianli/201218292371.htm
/jianli/201217216600.htm
/jianli/201217576170.htm
/jianli/201216012640.htm
//sc.chinaz.com/zt/hanyi/dabao.html
//font.chinaz.com/171108141211.htm
//font.chinaz.com/171108123820.htm
//sc.chinaz.com/ppt/zongjie.html
//sc.chinaz.com/ppt/jihuashu.html
//sc.chinaz.com/jianli/hushi.html
//sc.chinaz.com/jianli/tongyong.html
//sc.chinaz.com/jiaoben/huandengpian.html
//sc.chinaz.com/jiaoben/caidanhaohang.html
//sc.chinaz.com/jiaoben/jiaodiantu.html
//sc.chinaz.com/psd/mingpianmoban.html
//sc.chinaz.com/tupian/siwameinvtupian.html
//sc.chinaz.com/tupian/rentiyishu.html
//sc.chinaz.com/ppt/shangwupptmoban.html
//font.chinaz.com/heitiziti.html
//font.chinaz.com/keaiziti.html
//sc.chinaz.com/yinxiao/RenWuPeiYin.html
//sc.chinaz.com/yinxiao/lingshengxiazai.html
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
https://downsc.chinaz.net/Files/ ... 012/jianli14193.rar
/jianli/201214415200.htm
/jianli/201213443620.htm
/jianli/201210283381.htm
/jianli/201209238730.htm
//sc.chinaz.com/zt/hanyi/dabao.html
//font.chinaz.com/171108141211.htm
//font.chinaz.com/171108123820.htm
//sc.chinaz.com/ppt/zongjie.html
//sc.chinaz.com/ppt/jihuashu.html
//sc.chinaz.com/jianli/hushi.html
//sc.chinaz.com/jianli/tongyong.html
//sc.chinaz.com/jiaoben/huandengpian.html
//sc.chinaz.com/jiaoben/caidanhaohang.html
//sc.chinaz.com/jiaoben/jiaodiantu.html
//sc.chinaz.com/psd/mingpianmoban.html
//sc.chinaz.com/tupian/siwameinvtupian.html
//sc.chinaz.com/tupian/rentiyishu.html
//sc.chinaz.com/ppt/shangwupptmoban.html
//font.chinaz.com/heitiziti.html
//font.chinaz.com/keaiziti.html
//sc.chinaz.com/yinxiao/RenWuPeiYin.html
//sc.chinaz.com/yinxiao/lingshengxiazai.html

只爬取了第一页和第二页的第一个简历模版的各个下载链接地址,其他的简历模板都没有爬取成功,不知道错在哪里。。
最佳答案
2020-12-22 13:19:11
本帖最后由 suchocolate 于 2020-12-22 13:26 编辑

  1. import requests
  2. from lxml import etree
  3. import os


  4. def main():
  5.     folder = 'docs'
  6.     if not os.path.exists(folder):
  7.         os.mkdir(folder)
  8.     os.chdir(folder)
  9.     headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
  10.     result = []
  11.     url = "https://sc.chinaz.com/jianli/free.html"
  12.     num = int(input('请输入查询的页数:'))
  13.     for i in range(num):
  14.         r = requests.get(url, headers=headers)
  15.         html = etree.HTML(r.text)
  16.         temp = html.xpath('//div[@id="main"]/div/div/a/@href')
  17.         result.extend(temp)
  18.         nx_url = html.xpath('//a[@class="nextpage"]/@href')[0]
  19.         url = 'https://sc.chinaz.com/jianli/' + nx_url
  20.     # print(result)
  21.     file_counter = 1
  22.     for i in result:
  23.         url = 'https:' + i
  24.         r = requests.get(url, headers=headers)
  25.         html = etree.HTML(r.text)
  26.         doc = html.xpath('//ul[@class="clearfix"]/li[1]/a/@href')[0]
  27.         doc_name = doc.split('/')[-1]
  28.         r = requests.get(doc, headers=headers)
  29.         with open(doc_name, 'wb') as f:
  30.             f.write(r.content)
  31.             print(f'已下载{doc_name}, 共下载{file_counter}个简历。')
  32.             file_counter += 1


  33. if __name__ == "__main__":
  34.     main()
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2020-12-22 13:19:11 | 显示全部楼层    本楼为最佳答案   
本帖最后由 suchocolate 于 2020-12-22 13:26 编辑

  1. import requests
  2. from lxml import etree
  3. import os


  4. def main():
  5.     folder = 'docs'
  6.     if not os.path.exists(folder):
  7.         os.mkdir(folder)
  8.     os.chdir(folder)
  9.     headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36"}
  10.     result = []
  11.     url = "https://sc.chinaz.com/jianli/free.html"
  12.     num = int(input('请输入查询的页数:'))
  13.     for i in range(num):
  14.         r = requests.get(url, headers=headers)
  15.         html = etree.HTML(r.text)
  16.         temp = html.xpath('//div[@id="main"]/div/div/a/@href')
  17.         result.extend(temp)
  18.         nx_url = html.xpath('//a[@class="nextpage"]/@href')[0]
  19.         url = 'https://sc.chinaz.com/jianli/' + nx_url
  20.     # print(result)
  21.     file_counter = 1
  22.     for i in result:
  23.         url = 'https:' + i
  24.         r = requests.get(url, headers=headers)
  25.         html = etree.HTML(r.text)
  26.         doc = html.xpath('//ul[@class="clearfix"]/li[1]/a/@href')[0]
  27.         doc_name = doc.split('/')[-1]
  28.         r = requests.get(doc, headers=headers)
  29.         with open(doc_name, 'wb') as f:
  30.             f.write(r.content)
  31.             print(f'已下载{doc_name}, 共下载{file_counter}个简历。')
  32.             file_counter += 1


  33. if __name__ == "__main__":
  34.     main()
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-12-23 18:01:02 | 显示全部楼层

我现仔细看看大佬的代码。谢谢!
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2020-12-24 02:55:17 | 显示全部楼层

我看了哈我的代码,没有问题,就是jianli_text哪里开始都要缩进一下就可以了。。感谢大佬的代码
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-6-29 22:54

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表