Python爬取喜马拉雅免费相声音频V2

qiuyouzhi · 发表于 2020-5-2 11:53:07

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 qiuyouzhi 于 2020-5-2 11:54 编辑

Python爬取喜马拉雅免费相声音频V2

本来想在上一个帖子里面写的，但是沉了，所以这里重新发一个。

需求

1，打印第一页的所有相声专辑，并让用户选择。

2，下载此专辑的所有音乐。

思路

目标URL：https://www.ximalaya.com/xiangsheng/xiangsheng/mr132t2722/

好的，先踩点。

审查元素翻一下：

emmmm，表示并没有找到什么数据，只找到了一张没有用的图片。

这时候怎么办？抓包！

哦豁，看到了什么？一个开头为 audio 的文件！

点开看看：

看到src了吗？直接去看看src对应的值：

直接可以下载了！

所以，观察刚才的包含下载URL的URL：

https://www.ximalaya.com/revision/play/v1/audio?id=276429286&ptype=1

加粗部分需要一个id，目前我们还没有什么好办法去处理它，所以跳过。

先处理第一部分，爬取第一页的相声专辑：

看到这一大片的li标签了吗？我们要的内容就在这里面，

代码实现：

def open_url(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
res = get(url, headers=headers)
return res
def get_xs(res):
temp = 'https://www.ximalaya.com/xiangsheng'
html = etree.HTML(res.text)
name = html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a/span/text()")
href = [temp + each for each in html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a[1]/@href")] # 爬下来的链接是个残品，把它和temp组合在一起
i = 1
result = {}
for k in zip(name, href):
result[i] = k
i += 1
return result

复制代码

OK，名字和链接已经保存进字典了，

现在，爬取专辑里面的相声：

老规矩，都在li标签里面，可以写出代码：

def get_Videourl(nm, hf, page):
def tempfunc(list1):
for each in list1:
yield from each
vdurl = [] # 存放视频url的列表
name = []
for i in range(1, page + 1):
tempurl = hf + f'p{i}'
res = open_url(tempurl)
html = etree.HTML(res.text)
href = html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/@href')
name.append(html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/span/text()'))

复制代码

OK，差不多就这样了，但是，回头看刚才相声里的URL：

https://www.ximalaya.com/xiangsheng/35105051/290954115

诶嘿！后9位，不就是ID号吗？

直接再写一个XPath，提取出来：

def get_Videourl(nm, hf, page):
def tempfunc(list1):
for each in list1:
yield from each
vdurl = [] # 存放视频url的列表
name = []
for i in range(1, page + 1):
tempurl = hf + f'p{i}'
res = open_url(tempurl)
html = etree.HTML(res.text)
href = html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/@href')
name.append(html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/span/text()'))
ids = [each[-9:] for each in href] # 观察可知，id号在url的后9位，直接切片提出来
for id in ids:
vdurl.append('https://www.ximalaya.com/revision/play/v1/audio?id=%s&ptype=1' % id)
return vdurl, list(tempfunc(name))

复制代码

OK啦！现在就差下载数据了。

把ID搞下来后，下载数据最简单：

{"ret":200,"data":{"trackId":276429286,"canPlay":true,"isPaid":false,"hasBuy":true,"src":"https://aod.cos.tx.xmcdn.com/group77/M09/AE/4F/wKgO1V6DPwKCI8dMANcavfR4wq4296.m4a","albumIsSample":false,"sampleDuration":180,"isBaiduMusic":false,"firstPlayStatus":true}}

复制代码

链接就在"src"里面，下载代码：

def get_Video(vdurl, nm):
i = 0
for url in vdurl:
res = open_url(url).json()
for each in res:
if type(res[each]) != int:
tempurl = res[each]['src']
video = open_url(tempurl)
filename = f"{nm[i]}.m4a"
print("正在下载：",filename)
with open(filename, 'wb') as f:
f.write(video.content)
i += 1
print("下载完毕！")

复制代码

完整代码：

from requests import get
from lxml import etree
import os
try:
os.mkdir("Video")
os.chdir("Video")
except:
os.chdir("Video")
def open_url(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
res = get(url, headers=headers)
return res
def get_xs(res):
temp = 'https://www.ximalaya.com/xiangsheng'
html = etree.HTML(res.text)
name = html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a/span/text()")
href = [temp + each for each in html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a[1]/@href")]
i = 1
result = {}
for k in zip(name, href):
result[i] = k
i += 1
return result
def get_Videourl(nm, hf, page):
def tempfunc(list1):
for each in list1:
yield from each
vdurl = [] # 存放视频url的列表
name = []
for i in range(1, page + 1):
tempurl = hf + f'p{i}'
res = open_url(tempurl)
html = etree.HTML(res.text)
href = html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/@href')
name.append(html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/span/text()'))
ids = [each[-9:] for each in href] # 观察可知，id号在url的后9位，直接切片提出来
for id in ids:
vdurl.append('https://www.ximalaya.com/revision/play/v1/audio?id=%s&ptype=1' % id)
return vdurl, list(tempfunc(name))
def get_Video(vdurl, nm):
i = 0
for url in vdurl:
res = open_url(url).json()
for each in res:
if type(res[each]) != int:
tempurl = res[each]['src']
video = open_url(tempurl)
filename = f"{nm[i]}.m4a"
print("正在下载：",filename)
with open(filename, 'wb') as f:
f.write(video.content)
i += 1
print("下载完毕！")
def main():
url = 'https://www.ximalaya.com/xiangsheng/xiangsheng/mr132t2722/'
res = open_url(url)
result = get_xs(res)
for i in result:
for j in range(0,len(result[i]), 2):
print(i, end = ' ')
print(result[i][j])
choice = int(input("请选择您要听的专辑序号："))
nm, hf = result[choice]
page = int(input("请选择您要爬取的页码："))
vdurl, nm = get_Videourl(nm, hf, page)
get_Video(vdurl, nm)
if __name__ == "__main__":
main()

复制代码

乘号 · 发表于 2020-5-2 11:55:09

沙发

cupbbboom · 发表于 2020-5-2 12:58:57

2892150342ABC · 发表于 2020-8-19 21:44:08

不懂啊·

jmy_286501 · 发表于 2022-10-15 16:20:50

我是用selenium来爬取的，不过VIP的爬不了

账号		自动登录	找回密码
密码			立即注册

[技术交流] Python爬取喜马拉雅免费相声音频V2

马上注册，结交更多好友，享用更多功能^_^

评分

本帖被以下淘专辑推荐:

浏览过的版块