爬虫中json问题

slhlde · 发表于 2018-9-7 23:01:31

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

import pymongo
import json
import time
import requests

client=pymongo.MongoClient('localhost',27017)
mydb=client['mydb']
lagou=mydb['lagou']

headers={
'cookies':"xxxxx",
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
   AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
'Connection':'keep-live'
}

def get_page(url,params):
html=requests.get(url,data=params,headers=headers)
json_data=json.loads(html.text)
total_count=json_data['content']['positionResult']['totalCount']
page_number=int(total_count/15) if int(total_count/15) <30 else 30
get_page(url,page_number)

def get_info(url,page):
for pn in range(1,page+1):
      params={
         'first':'ture',
         'pn':str(pn),
         'kd':'Python'
      }
try:
      html=requests.get(url,data=params,headers=headers)
      print(html.text)
      json_data=json.loads(html.text)
      results=json_data['content']['positionResult']['result']
      for result in results:
         infos={
            'businessZones': result['businessZones'],
            'city': result['city'],
            'companyFullName': result['companyFullName'],
            'companyLabelList': result['companyLabelList'],
            'companySize': result['companySize'],
            'district': result['district'],
            'education': result['education'],
            'explain': result['explain'],
            'financeStage': result['financeStage'],
            'firstType': result['firstType'],
            'formatCreateTime': result['formatCreateTime'],
            'gradeDescription': result['gradeDescription'],
            'imState': result['imState'],
            'industryField': result['industryField'],
            'jobNature': result['jobNature'],
            'positionAdvantage': result['positionAdvantage'],
            'salary': result['salary'],
            'secondType': result['secondType'],
            'workYear': result['workYear']

         }
         lagou.insert_one(infos)
         time.sleep(2)
except requests.exceptions.ConnectionError:
      pass
if __name__=='__main__':
url='https://www.lagou.com/jobs/positionAjax.json'
params={
      'first':'true',
      'pn':'1',
      'kd':'Python'
}
get_page(url,params)
报错为：KeyError: 'content'
但是看了是有这个 content的啊。
请大家帮忙看看

拉了盏灯 · 发表于 2018-9-7 23:08:47

操作太过频繁，，，，
的确没有content

slhlde · 发表于 2018-9-7 23:11:33

拉了盏灯发表于 2018-9-7 23:08
操作太过频繁，，，，
的确没有content

已经加了 cookie 和 headers 还有限制啊？？

拉了盏灯 · 发表于 2018-9-7 23:12:09

slhlde 发表于 2018-9-7 23:11
已经加了 cookie 和 headers 还有限制啊？？

换个ip试试，

wongyusing · 发表于 2018-9-7 23:17:15

本帖最后由 wongyusing 于 2018-9-7 23:18 编辑

因为你没有拿到正确的数据，看我这里

#import pymongo
import json
import time
import requests
#client=pymongo.MongoClient('localhost',27017)
#mydb=client['mydb']
#lagou=mydb['lagou']
headers={
'cookies':"xxxxx",
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
'Connection':'keep-live'
}
def get_page(url,params):
html=requests.get(url,data=params,headers=headers)
json_data=json.loads(html.text)
# 这里打印一下获取到的的数据
print(json_data)
total_count=json_data['content']['positionResult']['totalCount']
page_number=int(total_count/15) if int(total_count/15) <30 else 30
get_page(url,page_number)
def get_info(url,page):
for pn in range(1,page+1):
params={
'first':'ture',
'pn':str(pn),
'kd':'Python'
}
try:
html=requests.get(url,data=params,headers=headers)
print(html.text)
json_data=json.loads(html.text)
results=json_data['content']['positionResult']['result']
for result in results:
infos={
'businessZones': result['businessZones'],
'city': result['city'],
'companyFullName': result['companyFullName'],
'companyLabelList': result['companyLabelList'],
'companySize': result['companySize'],
'district': result['district'],
'education': result['education'],
'explain': result['explain'],
'financeStage': result['financeStage'],
'firstType': result['firstType'],
'formatCreateTime': result['formatCreateTime'],
'gradeDescription': result['gradeDescription'],
'imState': result['imState'],
'industryField': result['industryField'],
'jobNature': result['jobNature'],
'positionAdvantage': result['positionAdvantage'],
'salary': result['salary'],
'secondType': result['secondType'],
'workYear': result['workYear']
}
print(infos)
#lagou.insert_one(infos)
time.sleep(2)
except requests.exceptions.ConnectionError:
pass
if __name__=='__main__':
url='https://www.lagou.com/jobs/positionAjax.json'
params={
'first':'true',
'pn':'1',
'kd':'Python'
}
get_page(url,params)

复制代码

打印你的获取到的数据，结果如下

(py_web) ➜ amazon python spider.py
{'success': False, 'msg': '您操作太频繁,请稍后再访问', 'clientIp': '218.15.235.105'}
Traceback (most recent call last):
File "spider.py", line 72, in <module>
get_page(url,params)
File "spider.py", line 21, in get_page
total_count=json_data['content']['positionResult']['totalCount']
KeyError: 'content'

复制代码

slhlde · 发表于 2018-9-8 13:23:58

wongyusing 发表于 2018-9-7 23:17
因为你没有拿到正确的数据，看我这里

打印你的获取到的数据，结果如下

谢谢您我看了你的博客请坚持我在学习您博客。

wongyusing · 发表于 2018-9-8 15:51:32

slhlde 发表于 2018-9-8 13:23
谢谢您我看了你的博客请坚持我在学习您博客。

哦，我没上传而已，
我还要修改一下步骤，有需要可以到github上，写到标签分类部分。
有项目文件和gitbook。
需要pdf可以生成一份给你

slhlde · 发表于 2018-9-8 17:35:50

wongyusing 发表于 2018-9-8 15:51
哦，我没上传而已，
我还要修改一下步骤，有需要可以到github上，写到标签分类部分。
有项目文件和gi ...

如果有pdf格式就更好了可以发送到我邮箱吗？？slhlde@163.com
谢谢您

wongyusing · 发表于 2018-9-8 18:11:30

slhlde 发表于 2018-9-8 17:35
如果有pdf格式就更好了可以发送到我邮箱吗？？
谢谢您

pdf版生成了，但是，代码方面分成一页一页的，
不方便阅读。等我吃完饭再弄吧。

wongyusing · 发表于 2018-9-8 23:01:37

wongyusing 发表于 2018-9-8 18:11
pdf版生成了，但是，代码方面分成一页一页的，
不方便阅读。等我吃完饭再弄吧。

下载这个github仓库。
使用方法看介绍。
最好使用node.js生成的服务器。
来浏览这些教程。
你也可以看pdf。不过不建议

账号		自动登录	找回密码
密码			立即注册