|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
这个问题我找了两天了,看了很多CSDN,那个乱码问题可以通过r.encoding = r.apparent_encoding解决,但是中间一大堆二进制我不知到怎么处理,希望大家帮忙看看,最好可以和我下面的代码匹配的
import requests
from requests.exceptions import RequestException
def get_one_page(url):
response = requests.get(url)
try:
if response.status_code == 200:
return response.text
return None
except RequestException:
return None
def main():
url = 'https://maoyan.com/board'
html = get_one_page(url)
print(html)
if __name__ == '__main__':
main()
爬出来的:
C:\Users\lijin\AppData\Local\Programs\Python\Python39-32\python.exe E:/QMDownload/pythonProject2/main.py
<!DOCTYPE html>
<!--[if IE 8]><html class="ie8"><![endif]-->
<!--[if IE 9]><html class="ie9"><![endif]-->
<!--[if gt IE 9]><!--><html><!--<![endif]-->
<head>
<title>电影院票房购票_评分_选座_经典影视推荐-猫眼电影</title>
<link rel="dns-prefetch" href="//p0.meituan.net" />
<link rel="dns-prefetch" href="//p1.meituan.net" />
<link rel="dns-prefetch" href="//ms0.meituan.net" />
<link rel="dns-prefetch" href="//s0.meituan.net" />
<link rel="dns-prefetch" href="//ms1.meituan.net" />
<link rel="dns-prefetch" href="//analytics.meituan.com" />
<link rel="dns-prefetch" href="//report.meituan.com" />
<link rel="dns-prefetch" href="//frep.meituan.com" />
<meta charset="utf-8">
<meta name="keywords" content="">
<meta name="description" content="">
<meta http-equiv="cleartype" content="yes" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="renderer" content="webkit" />
<meta name="HandheldFriendly" content="true" />
<meta name="format-detection" content="email=no" />
<meta name="format-detection" content="telephone=no" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<style>
/*! normalize.css v4.1.1 | MIT License | github.com/necolas/normalize.css */html{font-family:sans-serif;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%}body{margin:0}article,aside,details,figcaption,figure,footer,header,main,menu,nav,section,summary{display:block}audio,canvas,progress,video{display:inline-block}audio:not([controls]) 中间的都是二进制码,因为就篇幅长度限制,我就省略了 /cThLVFtGTtRiR7kC4HWn6uHI92SBxb1S8uy2ruKqba8q1nK6pF7pGv1H08jVCFamS5WDl2m1zj82CoB5DqBYA0O2jx8VsGi3+f0h8Jt9z94Aw2JaLdl8uPq8XqudrEQKEs1XdvjjeRed7ESO85xgG/xvK/38Sf7mV4VRn80Lfjsr36trSQmHQDhRagtLzFQXAanT1KuG9bEXQ0qVLezWv5RyBUJ1qBLPHR8RMU8rhS4Zgk4nopu2rgIDgeMIxR29nKDGZWsEcdVngJMecABVWpSwI2AcbgKgpn4uUyWBKBaVk2c3Zdu8S1cYp4vqnKD8Rk5V7bIaKz98S7e7UJcizwt16TLiEbiubXcIVOl0gjxIil3faholBrBbu5zQFHI6QIDSwLxGSjbkg3WL2btDiAIw33Kt89gvhZkZxUEqH8AIvFfo+zIuhTS8GmKDELmZrana15rHNcLY/auMexfk4SgDkXCFZ5MucL3QM6ZH/F2AAKF2kmR4aS34AAAAASUVORK5CYII=)}.app-download a .iphone-icon{background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAB4AAAAwCAYAAAAGlsrkAAAAo0lEQVR42u3YUQqAIAwG4B2kg+qbp6kT2I3qzTPUjAgpw6DmKv7BwCc/cfPBEXE45xprbcc5GmMmiYx7RyNalKBBCswcICw4L9paaIK3xIuhNhxNypzGb3V4INZS+r1zgJ9EU7wIk1AABgwYMGDAgAEDBgwY8J9gtU+bxDeV9+2LcK0E/D641HRnTXQbvtLpuWfzXVjiqnUGbGojRbUhqtbYeAb4Q5+BO1t9gAAAAABJRU5ErkJggg==)}.app-download a .download-icon{background-image:url(app-link-icon_2x.2470ea5731c3089bbffd3dac09fb7842.png)}
</style>
<style>
@font-face {
font-family: stonefont;
src: url('//vfile.meituan.net/colorstone/2d0fe6b3dc9b9a9f5b3487a81f50fea83420.eot');
src: url('//vfile.meituan.net/colorstone/2d0fe6b3dc9b9a9f5b3487a81f50fea83420.eot?#iefix') format('embedded-opentype'),
url('//vfile.meituan.net/colorstone/1ecfd52279ef3246e7880e020a19483d2276.woff') format('woff');
}
.stonefont {
font-family: stonefont;
}
</style>
</head>
<body>
<div class="header">
<div class="header-inner">
<a href="//maoyan.com" class="logo" data-act="icon-click"></a>
<div class="nav">
<ul class="navbar">
<li><a href="/" data-act="home-click" >首页</a></li>
<li><a href="/films" data-act="movies-click" >电影</a></li>
<li><a href="/cinemas" data-act="cinemas-click" >影院</a></li>
<li><a href="http://www.gewara.com">演出</a></li>
<li><a href="/board" data-act="board-click" >榜单</a></li>
<li><a href="/news" data-act="hotNews-click" >热点</a></li>
<li><a href="/edimall" >商城</a></li>
</ul>
</div>
<form action="/query" target="_blank" class="search-form" data-actform="search-click">
<input name="kw" class="search" type="search" maxlength="32" placeholder="找影视剧、影人、影院" autocomplete="off">
<input class="submit" type="submit" value="">
</form>
</div>
</div>
<div class="container" id="app" class="page-404/main" >
<div class="not-found-body">
<!-- <div class="img-wrap">
<img class="not-found-img" src="https://p0.meituan.net/scarlett/1219581ebe96d08cde224cd89d306a3e51912.png"/>
</div> -->
<p class="not-found-message">403 很抱歉,您的访问请求由于过于频繁而被禁止。</p>
<p class="error-message">sorry,your request was rejected.</p>
<p class="error-message">如有疑问,请将此页截图并发送邮件至 <a href="mailto:mywt@maoyan.com">mywt@maoyan.com</a></p>
<p class="error-message">--------------------------------------------- Request Info ---------------------------------------------</p>
<span class="error-msg line">访问时间:<p id="servertime"></p></span>
<span class="error-msg line">IP:<p id="ip"></p></span>
<span class="error-msg line">Referer:<p id="referer"></p></span>
<span class="error-msg line">User-Agent:<p id="ua"></p></span>
<div class="home-button"><a href="/">返回首页<a></div>
</div>
</div>
<div class="footer">
<p class="friendly-links">
关于猫眼 :
<a href="http://ir.maoyan.com/s/index.php#pageScroll0" target="_blank">关于我们</a>
<span></span>
<a href="http://ir.maoyan.com/s/index.php#pageScroll1" target="_blank">管理团队</a>
<span></span>
<a href="http://ir.maoyan.com/s/index.php#pageScroll2" target="_blank">投资者关系</a>
友情链接 :
<a href="http://www.meituan.com" data-query="utm_source=wwwmaoyan" target="_blank">美团网</a>
<span></span>
<a href="http://www.gewara.com" data-query="utm_source=wwwmaoyan">格瓦拉</a>
<span></span>
<a href="http://i.meituan.com/client" data-query="utm_source=wwwmaoyan" target="_blank">美团下载</a>
<span></span>
<a href="https://www.huanxi.com" data-query="utm_source=maoyan_pc" target="_blank">欢喜首映</a>
</p>
<p class="friendly-links">
商务合作邮箱:v@maoyan.com
客服电话:10105335
违法和不良信息举报电话:4006018900
</p>
<p class="friendly-links">
用户投诉邮箱:tousujubao@meituan.com
舞弊线索举报邮箱:wubijubao@maoyan.com
</p>
<p class="friendly-links credentials">
<a href="/about/licence/1" target="_blank">中华人民共和国增值电信业务经营许可证 京B2-20190350</a>
<span></span>
<a href="/about/licence/4" target="_blank">营业性演出许可证 京演(机构)(2019)4094号</a>
</p>
<p class="friendly-links credentials">
<a href="/about/licence/3" target="_blank">广播电视节目制作经营许可证 (京)字第08478号</a>
<span></span>
<a href="/about/licence/2" target="_blank">网络文化经营许可证 京网文(2019)3837-369号 </a>
</p>
<p class="friendly-links credentials">
<a href="/rules/agreement" target="_blank">猫眼用户服务协议 </a>
<span></span>
<a href="/rules/rule" target="_blank">猫眼平台交易规则总则 </a>
<span></span>
<a href="/rules/privacy" target="_blank">隐私政策 </a>
</p>
<p class="friendly-links credentials">
<a href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010102003232" target="_blank">京公网安备
11010102003232号</a>
<span></span>
<a href="http://www.beian.miit.gov.cn/" target="_blank">京ICP备16022489号</a>
</p>
<p>北京猫眼文化传媒有限公司</p>
<p>
©<span class="my-footer-year">2016</span>
猫眼电影 maoyan.com</p>
<div class="certificate">
<a href="http://sq.ccm.gov.cn:80/ccnt/sczr/service/business/emark/toDetail/350CF8BCA8416C4FE0530140A8C0957E"
target="_blank">
<img src="http://p0.meituan.net/moviemachine/e54374ccf134d1f7b2c5b075a74fca525326.png" />
</a>
<a href="/about/licence/5" target="_blank">
<img src="http://p1.meituan.net/moviemachine/805f605d5cf1b1a02a4e3a5e29df003b8376.png" />
</a>
</div>
</div>
</body>
</html>
<script src="https://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script>
<script src="https://pv.sohu.com/cityjson?ie=utf-8"></script>
<script type="text/javascript">
const now = new Date($.ajax({async: false}).getResponseHeader("Date"));
const time = now.toLocaleString();
$('#servertime').text(time);
$('#ip').text(returnCitySN["cip"]);
$('#referer').text(document.referrer);
$('#ua').text(navigator.userAgent);
</script>
Process finished with exit code 0
- import requests
- from lxml import etree
- def main():
- url = 'https://maoyan.com/board'
- headers = {'user-agent': 'firefox'}
- r = requests.get(url, headers=headers)
- html = etree.HTML(r.text)
- dds = html.xpath('//dd')
- for dd in dds:
- m_i = dd.xpath('./i/text()')[0]
- m_name = dd.xpath('./a/@title')[0]
- m_star = dd.xpath('normalize-space(./div[1]/div[1]/div[1]/p[2]/text())')[0]
- m_releasetime = dd.xpath('./div[1]/div[1]/div[1]/p[3]/text()')[0]
- print(m_i, m_name, m_star, m_releasetime)
- if __name__ == '__main__':
- main()
复制代码
|
-
|