做爬虫的一些问题,Python交流,编程语言专区,鱼C论坛

snowJR 发表于 2021-5-15 09:09:43

做爬虫的一些问题

url如下：
https://ned.ipac.caltech.edu/byname?objname=PKS%200002-478&hconst=67.8&omegam=0.308&omegav=0.692&wmap=4&corr_z=1

这个网页的数据画红圈的数据要怎么爬下来？

龙舞九天 发表于 2021-5-15 09:10:49

{:5_95:}

南归发表于 2021-5-15 10:18:07

https://ned.ipac.caltech.edu/ffs/sticky/CmdSrv

南归发表于 2021-5-15 10:18:51

F12找我上面发的那个链接

befal 发表于 2021-5-15 10:29:26

from selenium import webdriver
import time

driver = webdriver.Chrome()
driver.get('https://ned.ipac.caltech.edu/byname?objname=PKS%200002-478&hconst=67.8&omegam=0.308&omegav=0.692&wmap=4&corr_z=1')
time.sleep(5)

driver.find_element_by_id('ui-id-10').click()#切换到Photometry & SED (47)选项卡
time.sleep(5)

tab1 = driver.find_elements_by_xpath('//div[@class="fixedDataTableLayout_rowsContainer"]')
#因为该选项卡下有两个表格，所以用指定为第一个表格
cells = tab1.find_elements_by_class_name('public_fixedDataTableCell_cellContent')
#表格单元class均为public_fixedDataTableCell_cellContent
txt =
print(txt)

snowJR 发表于 2021-5-15 17:38:30

南归发表于 2021-5-15 10:18
F12找我上面发的那个链接

不是很明白{:5_104:}

南归发表于 2021-5-15 18:56:30

自己多分析F12咋用吧....

snowJR 发表于 2021-5-15 19:08:21

南归发表于 2021-5-15 18:56
自己多分析F12咋用吧....

这种属于ajax编码的网页吗？

南归发表于 2021-5-15 19:17:19

https://www.hualigs.cn/image/609fad3a06eca.jpg

先打开网页,等待加载完毕后,先清空抓包记录,再点击Photometry & SED (47),搜索Gamma-Ray,得到如图所示的界面

YunGuo 发表于 2021-5-15 21:06:14

楼上兄弟已经给你找到数据接口了，你直接携带参数请求数据接口就能拿到数据。
import requests

url = 'https://ned.ipac.caltech.edu/ffs/sticky/CmdSrv'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
}
data = {
'request': '{"startIdx":0,"pageSize":1000,"ffSessionId":"FF-Session-1621083370564","filters":"","source":"http://ned.ipac.caltech.edu/cgi-bin/objsearch?extend=no&out_csys=Equatorial&out_equinox=J2000.0&obj_sort=RA+or+Longitude&of=xml_qlphot&zv_breaker=30000.0&list_limit=5&img_stamp=YES&objname=PKS+0002-478&hconst=67.8&omegam=0.308&omegav=0.692&wmap=4&corr_z=1&objid=76355","alt_source":"http://ned.ipac.caltech.edu/cgi-bin/objsearch?extend=no&out_csys=Equatorial&out_equinox=J2000.0&obj_sort=RA+or+Longitude&of=xml_qlphot&zv_breaker=30000.0&list_limit=5&img_stamp=YES&objname=PKS+0002-478&hconst=67.8&omegam=0.308&omegav=0.692&wmap=4&corr_z=1&objid=76355","META_INFO":{"title":"qlphot","tbl_id":"tbl_id-c50b9-6","col.Refcode.PrefWidth":"20","col.Spectral Region.PrefWidth":"14","col.Band.PrefWidth":"17","col.Apparent Mag or Flux.PrefWidth":"16","col.Reference code.PrefWidth":"12","selectInfo":"false--0"},"tbl_id":"tbl_id-c50b9-6","id":"IpacTableFromSource"}',
'cmd': 'tableSearch'
}
res = requests.post(url, headers=headers, data=data)
print(res.json())

snowJR 发表于 2021-5-16 10:12:46

YunGuo 发表于 2021-5-15 21:06
楼上兄弟已经给你找到数据接口了，你直接携带参数请求数据接口就能拿到数据。

当我改变搜索的目标时，data的信息要自动获取？

YunGuo 发表于 2021-5-16 19:19:44

snowJR 发表于 2021-5-16 10:12
当我改变搜索的目标时，data的信息要自动获取？

改一下data查询参数就行，你举个例看看，你要搜索的其他内容，我分析分析查询参数。

snowJR 发表于 2021-5-17 07:29:51

YunGuo 发表于 2021-5-16 19:19
改一下data查询参数就行，你举个例看看，你要搜索的其他内容，我分析分析查询参数。

比如说我现在要查询 0106+013这一个的信息

snowJR 发表于 2021-5-17 07:30:43

YunGuo 发表于 2021-5-16 19:19
改一下data查询参数就行，你举个例看看，你要搜索的其他内容，我分析分析查询参数。

原来的搜索的url是这一个

https://ned.ipac.caltech.edu/

YunGuo 发表于 2021-5-17 21:19:01

snowJR 发表于 2021-5-17 07:30
原来的搜索的url是这一个

https://ned.ipac.caltech.edu/

import requests
import re
import time
from urllib import parse

def parser(datas):
for data in datas:
   print(data)

def get_data(objid, keyword):
url = 'https://ned.ipac.caltech.edu/ffs/sticky/CmdSrv'
ff = str(int(time.time() * 1000))
data = {
   'request': '{"startIdx":0,"pageSize":1000,"ffSessionId":"FF-Session-'+ff+'","filters":"","source":"http://ned.ipac.caltech.edu/cgi-bin/objsearch?extend=no&out_csys=Equatorial&out_equinox=J2000.0&obj_sort=RA+or+Longitude&of=xml_qlphot&zv_breaker=30000.0&list_limit=5&img_stamp=YES&objname='+keyword+'&hconst=67.8&omegam=0.308&omegav=0.692&wmap=4&corr_z=1&objid='+objid+'","alt_source":"http://ned.ipac.caltech.edu/cgi-bin/objsearch?extend=no&out_csys=Equatorial&out_equinox=J2000.0&obj_sort=RA+or+Longitude&of=xml_qlphot&zv_breaker=30000.0&list_limit=5&img_stamp=YES&objname='+keyword+'&hconst=67.8&omegam=0.308&omegav=0.692&wmap=4&corr_z=1&objid='+objid+'","META_INFO":{"title":"qlphot","tbl_id":"tbl_id-c50b9-6","col.Refcode.PrefWidth":"20","col.Spectral Region.PrefWidth":"14","col.Band.PrefWidth":"17","col.Apparent Mag or Flux.PrefWidth":"16","col.Reference code.PrefWidth":"12","selectInfo":"false--0"},"tbl_id":"tbl_id-c50b9-6","id":"IpacTableFromSource"}',
   'cmd': 'tableSearch'
}
res = requests.post(url, headers=headers, data=data)
return res.json()['tableData']['data']

def get_objid(keyword):
url = f'https://ned.ipac.caltech.edu/byname?objname={keyword}&hconst=67.8&omegam=0.308&omegav=0.692&wmap=4&corr_z=1'
res = requests.get(url, headers=headers)
objid = re.findall('objid=(.*?)"', res.text)
datas = get_data(objid, keyword)
parser(datas)

if __name__ == '__main__':
word = input('输入关键词：')
key_word = parse.quote(word)
headers = {
   'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'
}
get_objid(key_word)

snowJR 发表于 2021-5-18 08:34:56

YunGuo 发表于 2021-5-17 21:19

太感谢了！！

页: [1]

鱼C论坛's Archiver

做爬虫的一些问题