| 
 | 
 
60鱼币 
 本帖最后由 python羊 于 2021-5-21 15:44 编辑  
 
地址:https://www.gia.edu/sites/Satell ... p;cid=1495275503754 
 
 
 
 
提取类容在:4432行的全部类容。如下图: 
 
 
话说我只想要这个数据,为什么源代码这么多。。。。 
或许 有更快速的方法,请指教。感谢 
 
 
我的代码: 
—————————————— 
import requests 
 
import re 
 
 
s = requests.Session() 
 
headers={ 
    'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36', 
} 
 
url_end = 'https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754' 
 
r_end =s.get(url_end,headers=headers) 
 
r_end_str = r_end.text 
 
content_list=re.findall('<span style="display:none;" name="xmlcontent" id="xmlcontent">"(.*?)"</span>',r_end_str) 
 
print(content_list) 
感觉 bs4 快点,re 不怎么会,span 里面很多节点不知道怎么弄 
 
re(标签没去除)参考代码: 
- import requests
 
  
- import re
 
  
 
- s = requests.Session()
 
  
- headers={
 
 -     'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36',
 
 - }
 
  
- url_end = 'https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754'
 
  
- r_end =s.get(url_end,headers=headers)
 
  
- r_end_str = r_end.text
 
  
- content_list = re.findall('<REPORT_CHECK_RESPONSE>(.+)</REPORT_CHECK_RESPONSE>',r_end_str)[0]
 
  
- print(content_list)
 
  复制代码 
 
bs4 参考代码: 
- import requests
 
 - from bs4 import BeautifulSoup
 
  
 
- s = requests.Session()
 
  
- headers={
 
 -     'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Mobile Safari/537.36',
 
 - }
 
  
- url_end = 'https://www.gia.edu/sites/Satellite?reportno=6342219172&c=Page&childpagename=GIA%2FPage%2FReportCheck&pagename=GIA%2FWrapper&cid=1495275503754'
 
  
- r_end =s.get(url_end,headers=headers)
 
  
- r_end_str = r_end.text
 
 - soup = BeautifulSoup(r_end_str,'lxml')
 
  
- content_list= soup.find_all("span",id="xmlcontent")[0].text
 
  
- print(content_list)
 
  复制代码 
 
 
 |   
 
 
 
- 
 
 
 
 
最佳答案
查看完整内容 
感觉 bs4 快点,re 不怎么会,span 里面很多节点不知道怎么弄
re(标签没去除)参考代码:
bs4 参考代码: 
 
 
 
 
 
 
 |