新年快乐,小白求助,怎样爬取”掌上高考“网页中的院校信息?
先祝大家新年快乐财源滚滚日进斗金{:5_109:}python新手小白求教,需要爬取的是”掌上高考 “网站中“汉语言文学”专业的所有开设院校。试着用soup.find_all("div",class_="school-tab_schoolName__uLwTK")得出的是空的结果。分析网页元素发现,里面嵌套了很多层的div class,所以直接get得不到里面的院校名字的内容。查了网上说要用lxml或是CSS,就找了一些类似的lxml的tree内容,修改了部分代码,运行解译了很久,结果还是空。代码如下。 求助各位鱼油们,非常感谢{:5_92:}
import requests
from bs4 import BeautifulSoup
from lxml import etree
url = "”掌上高考网站 “中“汉语言文学”专业的所有开设院校的链接"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400 '}
response = requests.get(url,headers=headers)
html = response.content.decode('utf-8')
#print( html)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)
#print(html )
_element = etree.HTML(html)
print(_element.text)
root = etree.parse(html)
text = _element.xpath('*//div')
ent_text = etree.tostring(text,method='text',encoding='utf-8').decode()
ent_text = ent_text.replace(" ","")
ent_text =
print(f"最后结果:{ent_text}")
'''target_div = soup.select("div#target-div")
'''
'''
data = []
for div in soup.find_all("div",class_="school-tab_schoolName__uLwTK"):
print(div)
content = div.text.strip()
data.append(content)
''' 新年快乐!让我来帮助你解决这个问题。
根据你提供的代码,有几个问题需要修改。首先,你从网页中获取到的HTML内容应当是`response.text`而不是`response.content.decode('utf-8')`。其次,在使用lxml解析器时,你应该使用`etree.HTML(response.text)`而不是`etree.HTML(html)`。此外,你还可以尝试使用CSS选择器来查找目标元素。
下面是修改后的代码示例:
import requests
from bs4 import BeautifulSoup
url = "掌上高考网站中汉语言文学专业的所有开设院校的链接"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400 '}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
# 使用CSS选择器来查找目标元素
data = []
for div in soup.select("div.school-tab_schoolName__uLwTK"):
content = div.text.strip()
data.append(content)
print("最终结果:")
for item in data:
print(item)
希望这可以帮助你成功爬取到目标网页中的院校信息。祝你好运!
以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。 按上述的代码,运行还是空的结果。。。再修改,还是没结果{:5_104:} wingplay 发表于 2024-02-17 13:28
按上述的代码,运行还是空的结果。。。再修改,还是没 ...
非常抱歉你仍然面临问题。可能有几个原因导致你无法获取到数据。首先,确保你提供的网页链接是正确的,并且确保你的请求头部信息是有效的。其次,尝试使用`soup.prettify()`来查看BeautifulSoup解析后的HTML内容,确保你能够正确地找到目标元素。最后,检查目标元素的CSS选择器是否正确。
另外,根据你提供的代码和描述,我注意到你在使用lxml解析器时,将`_element`转换为文本格式时使用了`.text`,这可能会导致你无法正确获取目标元素的内容。你可以尝试使用`.xpath('string()')`而不是`.text`来获取文本内容。下面是修改后的代码示例:
import requests
from bs4 import BeautifulSoup
from lxml import etree
url = "掌上高考网站中汉语言文学专业的所有开设院校的链接"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400'}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, "html.parser")
# 使用CSS选择器来查找目标元素
data = []
for div in soup.select("div.school-tab_schoolName__uLwTK"):
content = div.text.strip()
data.append(content)
print("最终结果:")
for item in data:
print(item)
如果仍然无法获取到数据,请提供更多关于目标网页结构和要抓取内容的详细信息,这样我就能更好地帮助你了。祝你好运!
以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。 页面是动态加载的,懂?不懂也没关系,若对速度没啥要求的,用selenium就行了。 阿奇_o 发表于 2024-2-17 19:08
页面是动态加载的,懂?不懂也没关系,若对速度没啥要求的,用selenium就行了。
初学小白,真不懂。对速度没啥要求,请问有没有selenium的类似的代码段可参考? 本帖最后由 allen-xy 于 2024-2-20 10:44 编辑
这个网站需要通过逆向计算出请求中带的参数signsafe,其他参数可以是固定的。
网页(官网url)初始加载时,不会带着数据一起过来。
数据是通过ajax请求(另外一个url)动态获取到的,获取时会有一些验证过程,服务器会进行校验。
通过分析得到关键参数的计算方式,再将计算逻辑转换变python代码,之后就是常规的python爬虫代码了。
关键点是分析出signsafe的获取方式,其他都是基本操作。
bs仅限于获取网页源代码中有数据的情况,本例中的数据是通过动态加载的,所以不适用。
经过验证发现:
1、在请求时,不带signsafe这个字段也可以正常获取数据,可能网站并没有严格校验这个值,但是建议还是按照浏览器正常的请求参数来进行获取数据。
2、网站在生成signsafe时,并不是固定长度的,可能还需要进一步探索。
大概思路就是这样,楼主可以再深入研究。
免责声明:本脚本只用作学习使用,不得用于非法用途。
import base64
from hashlib import sha1
from hmac import new as hmac_new
from urllib.parse import unquote
import requests
import time
# signsafe的计算过程。
def v(t):
n = t['SIGN']
t = t['str']
t = unquote(t)
n = hmac_new(n.encode(), t.encode(), sha1).digest()
n = base64.b64encode(n).decode()
return n
session = requests.session()
session.headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}
for i in range(1, 6):
# 取前5页数据
url_1 = f"https://api.zjzw.cn/web/api/?is_single=2&local_province_id=12&page={i}&province_id=&request_type=1&size=10&special_id=40&top_school_id=2498&type=&uri=apidata/api/gk/special/school"
# 计算signsafe,每次请求都需要带一个signsafe,否则无法取到数据。
t = {
'SIGN': "D23ABC@#56",
'str': url_1
}
signsafe = v(t)
print("==========")
# 准备请求url和post请求的参数。
url_full = f"{url_1}&signsafe={signsafe}"
url_params = {
"is_single": "2",
"local_province_id": "12",
"page": i,
"province_id": "",
"request_type": "1",
"signsafe": signsafe,
"size": "10",
"special_id": "40",
"top_school_id": "2498",
"type": "",
"uri": "apidata/api/gk/special/school",
}
# 发送请求,获取数据。
resp = session.post(url_full, params=url_params)
# 得到的结果是json格式数据
res = resp.json()
for data in res['data']['item']:
# 这里根据需求取内容
print(f"学校:{data['name']}\t城市:{data['city_name']}")
# 他好我也好
time.sleep(2)
结果如下:
==========
学校:成都锦城学院 城市:成都市
学校:北京师范大学 城市:北京市
学校:南京大学城市:南京市
学校:北京大学城市:北京市
学校:复旦大学城市:上海市
学校:武汉大学城市:武汉市
学校:四川大学城市:成都市
学校:华东师范大学 城市:上海市
学校:浙江大学城市:杭州市
学校:中国人民大学 城市:北京市
==========
学校:陕西师范大学 城市:西安市
学校:中山大学城市:广州市
学校:山东大学城市:济南市
学校:南开大学城市:天津市
学校:暨南大学城市:广州市
学校:南京师范大学 城市:南京市
学校:吉林大学城市:长春市
学校:苏州大学城市:苏州市
学校:华中师范大学 城市:武汉市
学校:首都师范大学 城市:北京市
==========
学校:福建师范大学 城市:福州市
学校:上海师范大学 城市:上海市
学校:北京语言大学 城市:北京市
学校:浙江师范大学 城市:金华市
学校:西南大学城市:重庆市
学校:湖南师范大学 城市:长沙市
学校:上海大学城市:上海市
学校:东北师范大学 城市:长春市
学校:西北大学城市:西安市
学校:华南师范大学 城市:广州市
==========
学校:中央民族大学 城市:北京市
学校:上海交通大学 城市:上海市
学校:江苏师范大学 城市:徐州市
学校:扬州大学城市:扬州市
学校:山东师范大学 城市:济南市
学校:河南大学城市:开封市
学校:四川师范大学 城市:成都市
学校:兰州大学城市:兰州市
学校:云南大学城市:昆明市
学校:华中科技大学 城市:武汉市
==========
学校:河北大学城市:保定市
学校:北京外国语大学 城市:北京市
学校:郑州大学城市:郑州市
学校:广西师范大学 城市:桂林市
学校:黑龙江大学 城市:哈尔滨市
学校:安徽大学城市:合肥市
学校:厦门大学城市:厦门市
学校:中国传媒大学 城市:北京市
学校:内蒙古大学 城市:呼和浩特市
学校:南昌大学城市:南昌市
@管理员,不知道这样回复帖子有没有问题。如有违规等问题,请联系我更正或烦请版主更正。 非常感谢!请问您给的代码中url_1 = f"...." 这段的原网页地址是什么?我换了个page=1以后打开是这样的{"code":"0000","message":"成功---success","data":{"item":[{"admissions":"2","central":"2","city_name":"成都市","department":"2","doublehigh":"0","dual_class":"38003","dual_class_name":"","f211":2,"f985":2,"id":"gkspecialschool40/2498","is_top":1,"level_name":"普通本科","name":"成都锦城学院","nature_name":"民办","province_name":"四川","ruanke_level":"","ruanke_rank":"","school_id":2498,"tag_name":"","type_name":"综合类","xueke_rank":9999,"xueke_rank_score":""},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/52","is_top":2,"level_name":"普通本科","name":"北京师范大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"1","school_id":52,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"2","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"南京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/111","is_top":2,"level_name":"普通本科","name":"南京大学","nature_name":"公办","province_name":"江苏","ruanke_level":"A+","ruanke_rank":"2","school_id":111,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"5","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/31","is_top":2,"level_name":"普通本科","name":"北京大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"3","school_id":31,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"1","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/132","is_top":2,"level_name":"普通本科","name":"复旦大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"4","school_id":132,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"3","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"武汉市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/42","is_top":2,"level_name":"普通本科","name":"武汉大学","nature_name":"公办","province_name":"湖北","ruanke_level":"A+","ruanke_rank":"5","school_id":42,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"13","xueke_rank_score":"A-"},{"admissions":"1","central":"2","city_name":"成都市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/99","is_top":2,"level_name":"普通本科","name":"四川大学","nature_name":"公办","province_name":"四川","ruanke_level":"A+","ruanke_rank":"6","school_id":99,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"8","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/131","is_top":2,"level_name":"普通本科","name":"华东师范大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"7","school_id":131,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"4","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"杭州市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/114","is_top":2,"level_name":"普通本科","name":"浙江大学","nature_name":"公办","province_name":"浙江","ruanke_level":"A+","ruanke_rank":"8","school_id":114,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"6","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/46","is_top":2,"level_name":"普通本科","name":"中国人民大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"9","school_id":46,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"9","xueke_rank_score":"A-"}],"numFound":626},"location":"","encrydata":""} allen-xy 发表于 2024-2-19 18:19
这个网站需要通过逆向计算出请求中带的参数signsafe,其他参数可以是固定的。
网页(官网url)初始加载时 ...
非常感谢!请问您给的代码中url_1 = f"...." 这段的原网页地址是什么?我换了个page=1以后打开是这样的{"code":"0000","message":"成功---success","data":{"item":[{"admissions":"2","central":"2","city_name":"成都市","department":"2","doublehigh":"0","dual_class":"38003","dual_class_name":"","f211":2,"f985":2,"id":"gkspecialschool40/2498","is_top":1,"level_name":"普通本科","name":"成都锦城学院","nature_name":"民办","province_name":"四川","ruanke_level":"","ruanke_rank":"","school_id":2498,"tag_name":"","type_name":"综合类","xueke_rank":9999,"xueke_rank_score":""},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/52","is_top":2,"level_name":"普通本科","name":"北京师范大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"1","school_id":52,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"2","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"南京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/111","is_top":2,"level_name":"普通本科","name":"南京大学","nature_name":"公办","province_name":"江苏","ruanke_level":"A+","ruanke_rank":"2","school_id":111,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"5","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/31","is_top":2,"level_name":"普通本科","name":"北京大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"3","school_id":31,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"1","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/132","is_top":2,"level_name":"普通本科","name":"复旦大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"4","school_id":132,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"3","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"武汉市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/42","is_top":2,"level_name":"普通本科","name":"武汉大学","nature_name":"公办","province_name":"湖北","ruanke_level":"A+","ruanke_rank":"5","school_id":42,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"13","xueke_rank_score":"A-"},{"admissions":"1","central":"2","city_name":"成都市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/99","is_top":2,"level_name":"普通本科","name":"四川大学","nature_name":"公办","province_name":"四川","ruanke_level":"A+","ruanke_rank":"6","school_id":99,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"8","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/131","is_top":2,"level_name":"普通本科","name":"华东师范大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"7","school_id":131,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"4","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"杭州市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/114","is_top":2,"level_name":"普通本科","name":"浙江大学","nature_name":"公办","province_name":"浙江","ruanke_level":"A+","ruanke_rank":"8","school_id":114,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"6","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/46","is_top":2,"level_name":"普通本科","name":"中国人民大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"9","school_id":46,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"9","xueke_rank_score":"A-"}],"numFound":626},"location":"","encrydata":""} 本帖最后由 allen-xy 于 2024-2-21 10:46 编辑
wingplay 发表于 2024-2-21 10:24
非常感谢!请问您给的代码中url_1 = f"...." 这段的原网页地址是什么?我换了个page=1以后打开是这样的
这个url链接就是网页上面每页具体数据的请求链接(见附件截图),这个url并不是前台展示的初始url(你在浏览器地址栏处写的),而是网站后台自动发起请求时使用的链接(返回的是json数据,就是字典,从字典里拿你想要的数据就行。这种数据不是在浏览器中直接展示用的,而是需要浏览器渲染完成后才展示在初始url的页面中。),你得进入后再具体分析他使用的是哪个url来获取数据的。
你需要第2页数据就把page写成2,第3页写3,依次类推。
页码可以通过for循环实现。 allen-xy 发表于 2024-2-21 10:37
这个url链接就是网页上面每页具体数据的请求链接(见附件截图),这个url并不是前台展示的初始url(你 ...
{:5_106:}再请问signsafe的计算过程是怎么定的呢? wingplay 发表于 2024-2-22 13:05
再请问signsafe的计算过程是怎么定的呢?
通过查找url链接的载荷数据,找到每次请求的不同点,再去流量器抓包中找这个点的位置,大概率会在js代码中,然后再从js代码中反着查找,最终找到生成位置。
在这里不可能一点一点说清楚,内容太多。。。
这个过程就是逆向,阅读网站设计逻辑,反向生成python代码。
需要懂html、JavaScript等前端技术,一点一点推导结果。
页:
[1]