|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 qiuyouzhi 于 2020-5-2 11:54 编辑
Python爬取喜马拉雅免费相声音频V2
本来想在上一个帖子里面写的,但是沉了,所以这里重新发一个。
需求
1,打印第一页的所有相声专辑,并让用户选择。
2,下载此专辑的所有音乐。
思路
目标URL:https://www.ximalaya.com/xiangsheng/xiangsheng/mr132t2722/
好的,先踩点。
审查元素翻一下:
emmmm,表示并没有找到什么数据,只找到了一张没有用的图片。
这时候怎么办?抓包!
哦豁,看到了什么?一个开头为 audio 的文件!
点开看看:
看到src了吗?直接去看看src对应的值:
直接可以下载了!
所以,观察刚才的包含下载URL的URL:
https://www.ximalaya.com/revision/play/v1/audio?id=276429286&ptype=1
加粗部分需要一个id,目前我们还没有什么好办法去处理它,所以跳过。
先处理第一部分,爬取第一页的相声专辑:
看到这一大片的li标签了吗?我们要的内容就在这里面,
代码实现:
- def open_url(url):
- headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
- res = get(url, headers=headers)
- return res
- def get_xs(res):
- temp = 'https://www.ximalaya.com/xiangsheng'
- html = etree.HTML(res.text)
- name = html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a/span/text()")
- href = [temp + each for each in html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a[1]/@href")] # 爬下来的链接是个残品,把它和temp组合在一起
- i = 1
- result = {}
- for k in zip(name, href):
- result[i] = k
- i += 1
- return result
复制代码
OK,名字和链接已经保存进字典了,
现在,爬取专辑里面的相声:
老规矩,都在li标签里面,可以写出代码:
- def get_Videourl(nm, hf, page):
- def tempfunc(list1):
- for each in list1:
- yield from each
- vdurl = [] # 存放视频url的列表
- name = []
- for i in range(1, page + 1):
- tempurl = hf + f'p{i}'
- res = open_url(tempurl)
- html = etree.HTML(res.text)
- href = html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/@href')
- name.append(html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/span/text()'))
复制代码
OK,差不多就这样了,但是,回头看刚才相声里的URL:
https://www.ximalaya.com/xiangsheng/35105051/290954115
诶嘿!后9位,不就是ID号吗?
直接再写一个XPath,提取出来:
- def get_Videourl(nm, hf, page):
- def tempfunc(list1):
- for each in list1:
- yield from each
- vdurl = [] # 存放视频url的列表
- name = []
- for i in range(1, page + 1):
- tempurl = hf + f'p{i}'
- res = open_url(tempurl)
- html = etree.HTML(res.text)
- href = html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/@href')
- name.append(html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/span/text()'))
- ids = [each[-9:] for each in href] # 观察可知,id号在url的后9位,直接切片提出来
- for id in ids:
- vdurl.append('https://www.ximalaya.com/revision/play/v1/audio?id=%s&ptype=1' % id)
- return vdurl, list(tempfunc(name))
复制代码
OK啦!现在就差下载数据了。
把ID搞下来后,下载数据最简单:
- {"ret":200,"data":{"trackId":276429286,"canPlay":true,"isPaid":false,"hasBuy":true,"src":"https://aod.cos.tx.xmcdn.com/group77/M09/AE/4F/wKgO1V6DPwKCI8dMANcavfR4wq4296.m4a","albumIsSample":false,"sampleDuration":180,"isBaiduMusic":false,"firstPlayStatus":true}}
复制代码
链接就在"src"里面,下载代码:
- def get_Video(vdurl, nm):
- i = 0
- for url in vdurl:
- res = open_url(url).json()
- for each in res:
- if type(res[each]) != int:
- tempurl = res[each]['src']
- video = open_url(tempurl)
- filename = f"{nm[i]}.m4a"
- print("正在下载:",filename)
- with open(filename, 'wb') as f:
- f.write(video.content)
- i += 1
- print("下载完毕!")
复制代码
完整代码:
- from requests import get
- from lxml import etree
- import os
- try:
- os.mkdir("Video")
- os.chdir("Video")
- except:
- os.chdir("Video")
- def open_url(url):
- headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36'}
- res = get(url, headers=headers)
- return res
- def get_xs(res):
- temp = 'https://www.ximalaya.com/xiangsheng'
- html = etree.HTML(res.text)
- name = html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a/span/text()")
- href = [temp + each for each in html.xpath("//*[@class='general-album-list']/div[@class='content']/ul/li/div/a[1]/@href")]
- i = 1
- result = {}
- for k in zip(name, href):
- result[i] = k
- i += 1
- return result
- def get_Videourl(nm, hf, page):
- def tempfunc(list1):
- for each in list1:
- yield from each
- vdurl = [] # 存放视频url的列表
- name = []
- for i in range(1, page + 1):
- tempurl = hf + f'p{i}'
- res = open_url(tempurl)
- html = etree.HTML(res.text)
- href = html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/@href')
- name.append(html.xpath('//*[@class="sound-list _Qp"]/ul/li/div[2]/a/span/text()'))
- ids = [each[-9:] for each in href] # 观察可知,id号在url的后9位,直接切片提出来
- for id in ids:
- vdurl.append('https://www.ximalaya.com/revision/play/v1/audio?id=%s&ptype=1' % id)
- return vdurl, list(tempfunc(name))
- def get_Video(vdurl, nm):
- i = 0
- for url in vdurl:
- res = open_url(url).json()
- for each in res:
- if type(res[each]) != int:
- tempurl = res[each]['src']
- video = open_url(tempurl)
- filename = f"{nm[i]}.m4a"
- print("正在下载:",filename)
- with open(filename, 'wb') as f:
- f.write(video.content)
- i += 1
- print("下载完毕!")
- def main():
- url = 'https://www.ximalaya.com/xiangsheng/xiangsheng/mr132t2722/'
- res = open_url(url)
- result = get_xs(res)
- for i in result:
- for j in range(0,len(result[i]), 2):
- print(i, end = ' ')
- print(result[i][j])
- choice = int(input("请选择您要听的专辑序号:"))
- nm, hf = result[choice]
- page = int(input("请选择您要爬取的页码:"))
- vdurl, nm = get_Videourl(nm, hf, page)
- get_Video(vdurl, nm)
- if __name__ == "__main__":
- main()
复制代码 |
评分
-
查看全部评分
|