wingplay 发表于 2024-2-17 12:34:19

新年快乐,小白求助,怎样爬取”掌上高考“网页中的院校信息?

先祝大家新年快乐财源滚滚日进斗金{:5_109:}
python新手小白求教,需要爬取的是”掌上高考 “网站中“汉语言文学”专业的所有开设院校。试着用soup.find_all("div",class_="school-tab_schoolName__uLwTK")得出的是空的结果。分析网页元素发现,里面嵌套了很多层的div class,所以直接get得不到里面的院校名字的内容。查了网上说要用lxml或是CSS,就找了一些类似的lxml的tree内容,修改了部分代码,运行解译了很久,结果还是空。代码如下。 求助各位鱼油们,非常感谢{:5_92:}
import requests
from bs4 import BeautifulSoup
from lxml import etree


url = "”掌上高考网站 “中“汉语言文学”专业的所有开设院校的链接"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400 '}
response = requests.get(url,headers=headers)
html = response.content.decode('utf-8')
#print( html)
soup = BeautifulSoup(response.text, "html.parser")
#print(soup)

#print(html )
_element = etree.HTML(html)
print(_element.text)

root = etree.parse(html)


text = _element.xpath('*//div')
ent_text = etree.tostring(text,method='text',encoding='utf-8').decode()
ent_text = ent_text.replace(" ","")
ent_text =
print(f"最后结果:{ent_text}")

'''target_div = soup.select("div#target-div")
'''
'''
data = []
for div in soup.find_all("div",class_="school-tab_schoolName__uLwTK"):
    print(div)
    content = div.text.strip()
    data.append(content)
'''

FishC 发表于 2024-2-17 12:34:37

新年快乐!让我来帮助你解决这个问题。

根据你提供的代码,有几个问题需要修改。首先,你从网页中获取到的HTML内容应当是`response.text`而不是`response.content.decode('utf-8')`。其次,在使用lxml解析器时,你应该使用`etree.HTML(response.text)`而不是`etree.HTML(html)`。此外,你还可以尝试使用CSS选择器来查找目标元素。

下面是修改后的代码示例:

import requests
from bs4 import BeautifulSoup

url = "掌上高考网站中汉语言文学专业的所有开设院校的链接"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400 '}
response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, "html.parser")

# 使用CSS选择器来查找目标元素
data = []
for div in soup.select("div.school-tab_schoolName__uLwTK"):
    content = div.text.strip()
    data.append(content)

print("最终结果:")
for item in data:
    print(item)

希望这可以帮助你成功爬取到目标网页中的院校信息。祝你好运!

以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。

wingplay 发表于 2024-2-17 13:28:14

按上述的代码,运行还是空的结果。。。再修改,还是没结果{:5_104:}

FishC 发表于 2024-2-17 13:28:28

wingplay 发表于 2024-02-17 13:28
按上述的代码,运行还是空的结果。。。再修改,还是没 ...

非常抱歉你仍然面临问题。可能有几个原因导致你无法获取到数据。首先,确保你提供的网页链接是正确的,并且确保你的请求头部信息是有效的。其次,尝试使用`soup.prettify()`来查看BeautifulSoup解析后的HTML内容,确保你能够正确地找到目标元素。最后,检查目标元素的CSS选择器是否正确。

另外,根据你提供的代码和描述,我注意到你在使用lxml解析器时,将`_element`转换为文本格式时使用了`.text`,这可能会导致你无法正确获取目标元素的内容。你可以尝试使用`.xpath('string()')`而不是`.text`来获取文本内容。下面是修改后的代码示例:

import requests
from bs4 import BeautifulSoup
from lxml import etree

url = "掌上高考网站中汉语言文学专业的所有开设院校的链接"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3880.400 QQBrowser/10.8.4554.400'}

response = requests.get(url, headers=headers)
html = response.text

soup = BeautifulSoup(html, "html.parser")

# 使用CSS选择器来查找目标元素
data = []
for div in soup.select("div.school-tab_schoolName__uLwTK"):
    content = div.text.strip()
    data.append(content)

print("最终结果:")
for item in data:
    print(item)

如果仍然无法获取到数据,请提供更多关于目标网页结构和要抓取内容的详细信息,这样我就能更好地帮助你了。祝你好运!

以上回复来自 FishC AI 小助理,如未能正确解答您的问题,请继续追问。

阿奇_o 发表于 2024-2-17 19:08:32

页面是动态加载的,懂?不懂也没关系,若对速度没啥要求的,用selenium就行了。

wingplay 发表于 2024-2-18 21:34:43

阿奇_o 发表于 2024-2-17 19:08
页面是动态加载的,懂?不懂也没关系,若对速度没啥要求的,用selenium就行了。

初学小白,真不懂。对速度没啥要求,请问有没有selenium的类似的代码段可参考?

allen-xy 发表于 2024-2-19 18:19:27

本帖最后由 allen-xy 于 2024-2-20 10:44 编辑

这个网站需要通过逆向计算出请求中带的参数signsafe,其他参数可以是固定的。
网页(官网url)初始加载时,不会带着数据一起过来。
数据是通过ajax请求(另外一个url)动态获取到的,获取时会有一些验证过程,服务器会进行校验。
通过分析得到关键参数的计算方式,再将计算逻辑转换变python代码,之后就是常规的python爬虫代码了。

关键点是分析出signsafe的获取方式,其他都是基本操作。

bs仅限于获取网页源代码中有数据的情况,本例中的数据是通过动态加载的,所以不适用。
经过验证发现:
1、在请求时,不带signsafe这个字段也可以正常获取数据,可能网站并没有严格校验这个值,但是建议还是按照浏览器正常的请求参数来进行获取数据。
2、网站在生成signsafe时,并不是固定长度的,可能还需要进一步探索。

大概思路就是这样,楼主可以再深入研究。

免责声明:本脚本只用作学习使用,不得用于非法用途。

import base64
from hashlib import sha1
from hmac import new as hmac_new
from urllib.parse import unquote
import requests
import time


# signsafe的计算过程。
def v(t):
    n = t['SIGN']
    t = t['str']
    t = unquote(t)
    n = hmac_new(n.encode(), t.encode(), sha1).digest()
    n = base64.b64encode(n).decode()
    return n


session = requests.session()
session.headers = {
    'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36"
}

for i in range(1, 6):
    # 取前5页数据
    url_1 = f"https://api.zjzw.cn/web/api/?is_single=2&local_province_id=12&page={i}&province_id=&request_type=1&size=10&special_id=40&top_school_id=2498&type=&uri=apidata/api/gk/special/school"

    # 计算signsafe,每次请求都需要带一个signsafe,否则无法取到数据。
    t = {
      'SIGN': "D23ABC@#56",
      'str': url_1
    }
    signsafe = v(t)
    print("==========")

    # 准备请求url和post请求的参数。
    url_full = f"{url_1}&signsafe={signsafe}"
    url_params = {
      "is_single": "2",
      "local_province_id": "12",
      "page": i,
      "province_id": "",
      "request_type": "1",
      "signsafe": signsafe,
      "size": "10",
      "special_id": "40",
      "top_school_id": "2498",
      "type": "",
      "uri": "apidata/api/gk/special/school",
    }

    # 发送请求,获取数据。
    resp = session.post(url_full, params=url_params)
    # 得到的结果是json格式数据
    res = resp.json()

    for data in res['data']['item']:
      # 这里根据需求取内容
      print(f"学校:{data['name']}\t城市:{data['city_name']}")

    # 他好我也好
    time.sleep(2)


结果如下:
==========
学校:成都锦城学院      城市:成都市
学校:北京师范大学      城市:北京市
学校:南京大学城市:南京市      
学校:北京大学城市:北京市      
学校:复旦大学城市:上海市      
学校:武汉大学城市:武汉市      
学校:四川大学城市:成都市      
学校:华东师范大学      城市:上海市
学校:浙江大学城市:杭州市      
学校:中国人民大学      城市:北京市
==========
学校:陕西师范大学      城市:西安市
学校:中山大学城市:广州市
学校:山东大学城市:济南市
学校:南开大学城市:天津市
学校:暨南大学城市:广州市
学校:南京师范大学      城市:南京市
学校:吉林大学城市:长春市
学校:苏州大学城市:苏州市
学校:华中师范大学      城市:武汉市
学校:首都师范大学      城市:北京市
==========
学校:福建师范大学      城市:福州市
学校:上海师范大学      城市:上海市
学校:北京语言大学      城市:北京市
学校:浙江师范大学      城市:金华市
学校:西南大学城市:重庆市
学校:湖南师范大学      城市:长沙市
学校:上海大学城市:上海市
学校:东北师范大学      城市:长春市
学校:西北大学城市:西安市
学校:华南师范大学      城市:广州市
==========
学校:中央民族大学      城市:北京市
学校:上海交通大学      城市:上海市
学校:江苏师范大学      城市:徐州市
学校:扬州大学城市:扬州市
学校:山东师范大学      城市:济南市
学校:河南大学城市:开封市
学校:四川师范大学      城市:成都市
学校:兰州大学城市:兰州市
学校:云南大学城市:昆明市
学校:华中科技大学      城市:武汉市
==========
学校:河北大学城市:保定市
学校:北京外国语大学    城市:北京市
学校:郑州大学城市:郑州市
学校:广西师范大学      城市:桂林市
学校:黑龙江大学      城市:哈尔滨市
学校:安徽大学城市:合肥市
学校:厦门大学城市:厦门市
学校:中国传媒大学      城市:北京市
学校:内蒙古大学      城市:呼和浩特市
学校:南昌大学城市:南昌市

@管理员,不知道这样回复帖子有没有问题。如有违规等问题,请联系我更正或烦请版主更正。

wingplay 发表于 2024-2-21 10:23:08

非常感谢!请问您给的代码中url_1 = f"...."   这段的原网页地址是什么?我换了个page=1以后打开是这样的{"code":"0000","message":"成功---success","data":{"item":[{"admissions":"2","central":"2","city_name":"成都市","department":"2","doublehigh":"0","dual_class":"38003","dual_class_name":"","f211":2,"f985":2,"id":"gkspecialschool40/2498","is_top":1,"level_name":"普通本科","name":"成都锦城学院","nature_name":"民办","province_name":"四川","ruanke_level":"","ruanke_rank":"","school_id":2498,"tag_name":"","type_name":"综合类","xueke_rank":9999,"xueke_rank_score":""},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/52","is_top":2,"level_name":"普通本科","name":"北京师范大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"1","school_id":52,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"2","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"南京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/111","is_top":2,"level_name":"普通本科","name":"南京大学","nature_name":"公办","province_name":"江苏","ruanke_level":"A+","ruanke_rank":"2","school_id":111,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"5","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/31","is_top":2,"level_name":"普通本科","name":"北京大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"3","school_id":31,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"1","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/132","is_top":2,"level_name":"普通本科","name":"复旦大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"4","school_id":132,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"3","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"武汉市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/42","is_top":2,"level_name":"普通本科","name":"武汉大学","nature_name":"公办","province_name":"湖北","ruanke_level":"A+","ruanke_rank":"5","school_id":42,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"13","xueke_rank_score":"A-"},{"admissions":"1","central":"2","city_name":"成都市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/99","is_top":2,"level_name":"普通本科","name":"四川大学","nature_name":"公办","province_name":"四川","ruanke_level":"A+","ruanke_rank":"6","school_id":99,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"8","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/131","is_top":2,"level_name":"普通本科","name":"华东师范大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"7","school_id":131,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"4","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"杭州市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/114","is_top":2,"level_name":"普通本科","name":"浙江大学","nature_name":"公办","province_name":"浙江","ruanke_level":"A+","ruanke_rank":"8","school_id":114,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"6","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/46","is_top":2,"level_name":"普通本科","name":"中国人民大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"9","school_id":46,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"9","xueke_rank_score":"A-"}],"numFound":626},"location":"","encrydata":""}

wingplay 发表于 2024-2-21 10:24:27

allen-xy 发表于 2024-2-19 18:19
这个网站需要通过逆向计算出请求中带的参数signsafe,其他参数可以是固定的。
网页(官网url)初始加载时 ...

非常感谢!请问您给的代码中url_1 = f"...."   这段的原网页地址是什么?我换了个page=1以后打开是这样的{"code":"0000","message":"成功---success","data":{"item":[{"admissions":"2","central":"2","city_name":"成都市","department":"2","doublehigh":"0","dual_class":"38003","dual_class_name":"","f211":2,"f985":2,"id":"gkspecialschool40/2498","is_top":1,"level_name":"普通本科","name":"成都锦城学院","nature_name":"民办","province_name":"四川","ruanke_level":"","ruanke_rank":"","school_id":2498,"tag_name":"","type_name":"综合类","xueke_rank":9999,"xueke_rank_score":""},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/52","is_top":2,"level_name":"普通本科","name":"北京师范大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"1","school_id":52,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"2","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"南京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/111","is_top":2,"level_name":"普通本科","name":"南京大学","nature_name":"公办","province_name":"江苏","ruanke_level":"A+","ruanke_rank":"2","school_id":111,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"5","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/31","is_top":2,"level_name":"普通本科","name":"北京大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"3","school_id":31,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"1","xueke_rank_score":"A+"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/132","is_top":2,"level_name":"普通本科","name":"复旦大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"4","school_id":132,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"3","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"武汉市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/42","is_top":2,"level_name":"普通本科","name":"武汉大学","nature_name":"公办","province_name":"湖北","ruanke_level":"A+","ruanke_rank":"5","school_id":42,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"13","xueke_rank_score":"A-"},{"admissions":"1","central":"2","city_name":"成都市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/99","is_top":2,"level_name":"普通本科","name":"四川大学","nature_name":"公办","province_name":"四川","ruanke_level":"A+","ruanke_rank":"6","school_id":99,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"8","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"上海市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/131","is_top":2,"level_name":"普通本科","name":"华东师范大学","nature_name":"公办","province_name":"上海","ruanke_level":"A+","ruanke_rank":"7","school_id":131,"tag_name":"教育部直属","type_name":"师范类","xueke_rank":"4","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"杭州市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/114","is_top":2,"level_name":"普通本科","name":"浙江大学","nature_name":"公办","province_name":"浙江","ruanke_level":"A+","ruanke_rank":"8","school_id":114,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"6","xueke_rank_score":"A"},{"admissions":"1","central":"2","city_name":"北京市","department":"1","doublehigh":"0","dual_class":"38000","dual_class_name":"双一流","f211":1,"f985":1,"id":"gkspecialschool40/46","is_top":2,"level_name":"普通本科","name":"中国人民大学","nature_name":"公办","province_name":"北京","ruanke_level":"A+","ruanke_rank":"9","school_id":46,"tag_name":"教育部直属","type_name":"综合类","xueke_rank":"9","xueke_rank_score":"A-"}],"numFound":626},"location":"","encrydata":""}

allen-xy 发表于 2024-2-21 10:37:40

本帖最后由 allen-xy 于 2024-2-21 10:46 编辑

wingplay 发表于 2024-2-21 10:24
非常感谢!请问您给的代码中url_1 = f"...."   这段的原网页地址是什么?我换了个page=1以后打开是这样的

这个url链接就是网页上面每页具体数据的请求链接(见附件截图),这个url并不是前台展示的初始url(你在浏览器地址栏处写的),而是网站后台自动发起请求时使用的链接(返回的是json数据,就是字典,从字典里拿你想要的数据就行。这种数据不是在浏览器中直接展示用的,而是需要浏览器渲染完成后才展示在初始url的页面中。),你得进入后再具体分析他使用的是哪个url来获取数据的。
你需要第2页数据就把page写成2,第3页写3,依次类推。
页码可以通过for循环实现。

wingplay 发表于 2024-2-22 13:05:21

allen-xy 发表于 2024-2-21 10:37
这个url链接就是网页上面每页具体数据的请求链接(见附件截图),这个url并不是前台展示的初始url(你 ...

{:5_106:}再请问signsafe的计算过程是怎么定的呢?

allen-xy 发表于 2024-2-22 15:01:59

wingplay 发表于 2024-2-22 13:05
再请问signsafe的计算过程是怎么定的呢?

通过查找url链接的载荷数据,找到每次请求的不同点,再去流量器抓包中找这个点的位置,大概率会在js代码中,然后再从js代码中反着查找,最终找到生成位置。
在这里不可能一点一点说清楚,内容太多。。。
这个过程就是逆向,阅读网站设计逻辑,反向生成python代码。
需要懂html、JavaScript等前端技术,一点一点推导结果。
页: [1]
查看完整版本: 新年快乐,小白求助,怎样爬取”掌上高考“网页中的院校信息?