|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
就以爬取B站老番茄视频数据为例子,代码如下:import requests
from bs4 import BeautifulSoup
import lxml
def open_url(url):
headers ={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
res = requests.get(url,headers = headers)
return res
def main():
url = "https://space.bilibili.com/546195/video"
soup = BeautifulSoup(open_url(url).text,'lxml')
print(soup.prettify())
if __name__ == "__main__":
main()
上面代码只是用来检测爬取到的源码是否一致,因为发不了图片,我也不知道怎么描述,就是爬取到的内容和原网页审查元素不一致,该有的视频的相关资料都没有...
补上输出结果吧:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="initial-scale=1,maximum-scale=1,user-scalable=no" name="viewport"/>
<meta content="" name="keywords"/>
<meta content="" name="description"/>
<meta content="no" name="apple-mobile-web-app-capable"/>
<meta content="telephone=no" name="format-detection"/>
<title>
搜索 | 腾讯招聘
</title>
<link href="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/css/main.css" rel="stylesheet"/>
<link href="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/css/jquery-ui.min.css" rel="stylesheet"/>
</head>
<body>
<div id="app">
<script src="https://cdn.multilingualres.hr.tencent.com/careersmlr/HeadFoot_zh-cn.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/careersmlr/HostMsg_zh-cn.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/careersmlr/Search_zh-cn.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/vendor/config.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/vendor/jquery.min.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/vendor/jquery.ellipsis.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/vendor/report.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/vendor/qrcode.min.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/manifest.build.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/vendor.build.js" type="text/javascript">
</script>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/p_zh-cn_search.build.js" type="text/javascript">
</script>
</body>
<script src="https://cdn.multilingualres.hr.tencent.com/tencentcareer/static/js/vendor/common.js" type="text/javascript">
</script>
</html>
PS F:\All_about_study\VS_code\python> & C:/Users/asus/AppData/Local/Programs/Python/Python38-32/python.exe f:/All_about_study/VS_code/python/爬虫/Beautifulsoup_text.py
<!DOCTYPE html>
<html>
<head>
<meta content="333.999" name="spm_prefix"/>
<meta charset="utf-8"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="webkit|ie-comp|ie-stand" name="renderer"/>
<script type="text/javascript">
window.__BILI_CONFIG__={"show_bv":true}
</script>
<script type="text/javascript">
var ua=window.navigator.userAgent,agents=["Android","iPhone","SymbianOS","Windows Phone","iPod"],pathname=/\d+/.exec(window.location.pathname),getCookie=function(e){return decodeURIComponent(document.cookie.replace(new RegExp("(?:(?:^|.*;)\\s*"+encodeURIComponent(e).replace(/[\-\.\+\*]/g,"\\[ DISCUZ_CODE_1 ]amp;")+"\\s*\\=\\s*([^;]*).*$)|^.*[ DISCUZ_CODE_1 ]quot;),"$1"))||null},DedeUserID=getCookie("DedeUserID"),mid=pathname?+pathname[0]:null===DedeUserID?0:+DedeUserID;if(mid<1)window.location.href="https://passport.bilibili.com/login?gourl=https://space.bilibili.com";else{window._bili_space_mid=mid,window._bili_space_mymid=null===DedeUserID?0:+DedeUserID;for(var prefix=/^\/v/.test(pathname)?"/v":"",i=0;i<agents.length;i++)if(-1<ua.indexOf(agents [i ])){window.location.href="https://m.bilibili.com/space/"+mid;break}}
</script>
<link as="script" href="//s1.hdslb.com/bfs/static/player/main/video.js?v=2020330" rel="prefetch"/>
<script src="//s1.hdslb.com/bfs/static/jinkela/long/js/sentry/sentry-5.2.1.min.js" type="text/javascript">
</script>
<script src="//s1.hdslb.com/bfs/static/jinkela/long/js/sentry/sentry.vue.js" type="text/javascript">
</script>
<link href="//s1.hdslb.com/bfs/static/jinkela/space/css/space.4.6fe27996f6ff0fcc831bb0798235df84ca1445fd.css" rel="stylesheet"/>
<link href="//s1.hdslb.com/bfs/static/jinkela/space/css/space.3.6fe27996f6ff0fcc831bb0798235df84ca1445fd.css" rel="stylesheet"/>
<title>
不动的ACG大图书馆的个人空间 - 哔哩哔哩 ( ゜- ゜)つロ 乾杯~ Bilibili
</title>
<meta content="不动的ACG大图书馆,B站,弹幕,字幕,AMV,MAD,MTV,ANIME,动漫,动漫音乐,游戏,游戏解说,ACG,galgame,动画,番组,新番,初音,洛天依,vocaloid" name="keywords"/>
<meta content="不动的ACG大图书馆,官方邮箱: acglibrary@bilibili.com 《动画透镜》等系列栏目持续更新中!我们热爱动画,和你一起。,bilibili是国内知名的视
频弹幕网站,这里有最及时的动漫新番,最棒的ACG氛围,最有创意的Up主。大家可以在这里找到许多欢乐。" name="description"/>
</head>
<body>
<div id="biliMainHeader" style="height:56px">
</div>
<!--[if lt IE 9]><div id="browser-version-tip">
<div class="wrapper">
抱歉,您正在使用不支持的浏览器访问个人空间。推荐您<a href="//www.google.cn/chrome/browser/desktop/index.html">安装 Chrome 浏览器</a>以获得更好的体验
ヾ(o◕∀◕)ノ
</div>
</div><![endif]-->
<div id="space-app"> 这一行本来是每个视频相关资料的,结果被替换成了这个
</div>
<script type="text/javascript">
window.spaceReport={},window.reportConfig={sample:1,scrollTracker:!0,msgObjects:"spaceReport"};var reportScript=document.createElement("script");reportScript.src="//s1.hdslb.com/bfs/seed/log/report/log-reporter.js",document.getElementsByTagName("body")[0].appendChild(reportScript),reportScript.onerror=function(){console.warn("log-reporter.js加载失败,放弃上报");var r=function(){};window.reportObserver={sendPV:r,forceCommit:r}}
</script>
<script src="//s1.hdslb.com/bfs/static/jinkela/long/js/jquery/jquery1.7.2.min.js">
</script>
<script defer="defer" src="//s1.hdslb.com/bfs/seed/jinkela/header-v2/header.js" type="text/javascript">
</script>
<script src="//s1.hdslb.com/bfs/static/jinkela/space/4.space.6fe27996f6ff0fcc831bb0798235df84ca1445fd.js" type="text/javascript">
</script>
<script src="//s1.hdslb.com/bfs/static/jinkela/space/space.6fe27996f6ff0fcc831bb0798235df84ca1445fd.js" type="text/javascript">
</script>
</body>
</html>
具体我也不是很懂,因为从来没爬过视频。
- b站翻视频动态页的时候,明显是一个瀑布流式的页面,所以应该是你一次请求并没有请求到想要的信息。
- 至于视频播放页面,我的电脑上没有flashplayer,b站估计也是用的HTML5。
- 如果是我的话,我会用selenium发送请求,然后再用lxml解析。
- 至于楼上说的反爬虫情况,我认为不是的。设置常见的机器人验证,header,ip,cookie等,只会是你请求不到相关页面。
- 我知道的html加密和映射,看楼主请求到的html是没有的。
- 我觉得应该b站页面是js动态渲染出来的,遇到这种情况,我会使用selenium,因为我不会其他的。
- b站上有许多讲爬取js动态页面的或者js加密的视频,你可以看一下。
|
|