|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
我自己爬的时候加了代理IP
- import requests # 网络请求模块
- import time # 时间模块
- import random # 随机模块
- from bs4 import BeautifulSoup
- json_url = 'https://www.zhihu.com/hot'
- cookies ='这里输入您的登录cookies信息'
- # 创建RequestsCookieJar对象,用于设置cookies信息
- cookies_jar = requests.cookies.RequestsCookieJar()
- for cookie in cookies.split(';'):
- key, value = cookie.split('=', 1)
- cookies_jar.set(key, value) # 将cookies保存RequestsCookieJar当中
- class Crawl():
- def __init__(self):
- # 创建头部信息
- self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36',
- 'Referer': 'https://www.zhihu.com/',}
- def get_json(self,json_url):
- response = requests.get(json_url, headers=self.headers,cookies=cookies_jar)
- soup = BeautifulSoup(response.text,'lxml')
- title = soup.findAll('h2',{'class':"HotItem-title"})
- for i in title:
- item = {}
- item['title'] = i.get_text() #爬取个标题
- print(item)
- if __name__ == '__main__':
- c = Crawl() # 创建爬虫类对象
- text = c.get_json(json_url)
- time.sleep(random.randint(2,4)) # 随机产生获取json请求的间隔时间'''
复制代码 |
评分
-
查看全部评分
|