寄安 发表于 2021-9-30 21:02:13

爬虫出的部分是乱码,而且里面中间部分好像是被压缩的,和视频上爬出来的网页bu'yi...

这个问题我找了两天了,看了很多CSDN,那个乱码问题可以通过r.encoding = r.apparent_encoding解决,但是中间一大堆二进制我不知到怎么处理,希望大家帮忙看看,最好可以和我下面的代码匹配的

import requests
from requests.exceptions import RequestException

def get_one_page(url):
    response = requests.get(url)
    try:
      if response.status_code == 200:
            return response.text
      return None
    except RequestException:
      return None

def main():
    url = 'https://maoyan.com/board'
    html = get_one_page(url)
    print(html)
   
if __name__ == '__main__':
    main()

爬出来的:
C:\Users\lijin\AppData\Local\Programs\Python\Python39-32\python.exe E:/QMDownload/pythonProject2/main.py
<!DOCTYPE html>

<!--><html class="ie8"><!-->
<!--><html class="ie9"><!-->
<!--><!--><html><!--<!-->
<head>
<title>电影院票房购票_评分_选座_经典影视推荐-猫眼电影</title>

<link rel="dns-prefetch" href="//p0.meituan.net"/>
<link rel="dns-prefetch" href="//p1.meituan.net"/>
<link rel="dns-prefetch" href="//ms0.meituan.net" />
<link rel="dns-prefetch" href="//s0.meituan.net" />
<link rel="dns-prefetch" href="//ms1.meituan.net" />
<link rel="dns-prefetch" href="//analytics.meituan.com" />
<link rel="dns-prefetch" href="//report.meituan.com" />
<link rel="dns-prefetch" href="//frep.meituan.com" />


<meta charset="utf-8">
<meta name="keywords" content="">
<meta name="description" content="">
<meta http-equiv="cleartype" content="yes" />
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta name="renderer" content="webkit" />

<meta name="HandheldFriendly" content="true" />
<meta name="format-detection" content="email=no" />
<meta name="format-detection" content="telephone=no" />
<meta name="viewport" content="width=device-width, initial-scale=1">

<style>
    /*! normalize.css v4.1.1 | MIT License | github.com/necolas/normalize.css */html{font-family:sans-serif;-ms-text-size-adjust:100%;-webkit-text-size-adjust:100%}body{margin:0}article,aside,details,figcaption,figure,footer,header,main,menu,nav,section,summary{display:block}audio,canvas,progress,video{display:inline-block}audio:not()      中间的都是二进制码,因为就篇幅长度限制,我就省略了   /cThLVFtGTtRiR7kC4HWn6uHI92SBxb1S8uy2ruKqba8q1nK6pF7pGv1H08jVCFamS5WDl2m1zj82CoB5DqBYA0O2jx8VsGi3+f0h8Jt9z94Aw2JaLdl8uPq8XqudrEQKEs1XdvjjeRed7ESO85xgG/xvK/38Sf7mV4VRn80Lfjsr36trSQmHQDhRagtLzFQXAanT1KuG9bEXQ0qVLezWv5RyBUJ1qBLPHR8RMU8rhS4Zgk4nopu2rgIDgeMIxR29nKDGZWsEcdVngJMecABVWpSwI2AcbgKgpn4uUyWBKBaVk2c3Zdu8S1cYp4vqnKD8Rk5V7bIaKz98S7e7UJcizwt16TLiEbiubXcIVOl0gjxIil3faholBrBbu5zQFHI6QIDSwLxGSjbkg3WL2btDiAIw33Kt89gvhZkZxUEqH8AIvFfo+zIuhTS8GmKDELmZrana15rHNcLY/auMexfk4SgDkXCFZ5MucL3QM6ZH/F2AAKF2kmR4aS34AAAAASUVORK5CYII=)}.app-download a .iphone-icon{background-image:url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAB4AAAAwCAYAAAAGlsrkAAAAo0lEQVR42u3YUQqAIAwG4B2kg+qbp6kT2I3qzTPUjAgpw6DmKv7BwCc/cfPBEXE45xprbcc5GmMmiYx7RyNalKBBCswcICw4L9paaIK3xIuhNhxNypzGb3V4INZS+r1zgJ9EU7wIk1AABgwYMGDAgAEDBgwY8J9gtU+bxDeV9+2LcK0E/D641HRnTXQbvtLpuWfzXVjiqnUGbGojRbUhqtbYeAb4Q5+BO1t9gAAAAABJRU5ErkJggg==)}.app-download a .download-icon{background-image:url(app-link-icon_2x.2470ea5731c3089bbffd3dac09fb7842.png)}
</style>

<style>
    @font-face {
      font-family: stonefont;
      src: url('//vfile.meituan.net/colorstone/2d0fe6b3dc9b9a9f5b3487a81f50fea83420.eot');
      src: url('//vfile.meituan.net/colorstone/2d0fe6b3dc9b9a9f5b3487a81f50fea83420.eot?#iefix') format('embedded-opentype'),
         url('//vfile.meituan.net/colorstone/1ecfd52279ef3246e7880e020a19483d2276.woff') format('woff');
    }

    .stonefont {
      font-family: stonefont;
    }
</style>
</head>
<body>


<div class="header">
<div class="header-inner">
          <a href="//maoyan.com" class="logo" data-act="icon-click"></a>


      <div class="nav">
            <ul class="navbar">
                <li><a href="/" data-act="home-click">首页</a></li>
                <li><a href="/films" data-act="movies-click" >电影</a></li>
                <li><a href="/cinemas" data-act="cinemas-click" >影院</a></li>
                <li><a href="http://www.gewara.com">演出</a></li>
               
                <li><a href="/board" data-act="board-click" >榜单</a></li>
                <li><a href="/news" data-act="hotNews-click" >热点</a></li>
                <li><a href="/edimall">商城</a></li>
            </ul>
      </div>


      <form action="/query" target="_blank" class="search-form" data-actform="search-click">
            <input name="kw" class="search" type="search" maxlength="32" placeholder="找影视剧、影人、影院" autocomplete="off">
            <input class="submit" type="submit" value="">
      </form>

   
</div>
</div>


    <div class="container" id="app" class="page-404/main" >
<div class="not-found-body">
    <!-- <div class="img-wrap">
      <img class="not-found-img" src="https://p0.meituan.net/scarlett/1219581ebe96d08cde224cd89d306a3e51912.png"/>
    </div> -->
    <p class="not-found-message">403 很抱歉,您的访问请求由于过于频繁而被禁止。</p>
    <p class="error-message">sorry,your request was rejected.</p>
    <p class="error-message">如有疑问,请将此页截图并发送邮件至 <a href="mailto:mywt@maoyan.com">mywt@maoyan.com</a></p>
    <p class="error-message">--------------------------------------------- Request Info ---------------------------------------------</p>
    <span class="error-msg line">访问时间:<p id="servertime"></p></span>
    <span class="error-msg line">IP:<p id="ip"></p></span>
    <span class="error-msg line">Referer:<p id="referer"></p></span>
    <span class="error-msg line">User-Agent:<p id="ua"></p></span>
    <div class="home-button"><a href="/">返回首页<a></div>   
</div>
    </div>

<div class="footer">
<p class="friendly-links">
    关于猫眼 :
    <a href="http://ir.maoyan.com/s/index.php#pageScroll0" target="_blank">关于我们</a>
    <span></span>
    <a href="http://ir.maoyan.com/s/index.php#pageScroll1" target="_blank">管理团队</a>
    <span></span>
    <a href="http://ir.maoyan.com/s/index.php#pageScroll2" target="_blank">投资者关系</a>
    &nbsp;&nbsp;&nbsp;&nbsp;
    友情链接 :
    <a href="http://www.meituan.com" data-query="utm_source=wwwmaoyan" target="_blank">美团网</a>
    <span></span>
    <a href="http://www.gewara.com" data-query="utm_source=wwwmaoyan">格瓦拉</a>
    <span></span>
    <a href="http://i.meituan.com/client" data-query="utm_source=wwwmaoyan" target="_blank">美团下载</a>
    <span></span>
    <a href="https://www.huanxi.com" data-query="utm_source=maoyan_pc" target="_blank">欢喜首映</a>
</p>
<p class="friendly-links">
    商务合作邮箱:v@maoyan.com
    客服电话:10105335
    违法和不良信息举报电话:4006018900
</p>
<p class="friendly-links">
    用户投诉邮箱:tousujubao@meituan.com
    舞弊线索举报邮箱:wubijubao@maoyan.com
</p>
<p class="friendly-linkscredentials">
    <a href="/about/licence/1" target="_blank">中华人民共和国增值电信业务经营许可证 京B2-20190350</a>
    <span></span>
    <a href="/about/licence/4" target="_blank">营业性演出许可证 京演(机构)(2019)4094号</a>
</p>
<p class="friendly-linkscredentials">
    <a href="/about/licence/3" target="_blank">广播电视节目制作经营许可证 (京)字第08478号</a>
    <span></span>
    <a href="/about/licence/2" target="_blank">网络文化经营许可证 京网文(2019)3837-369号 </a>
</p>
<p class="friendly-linkscredentials">
    <a href="/rules/agreement" target="_blank">猫眼用户服务协议 </a>
    <span></span>
    <a href="/rules/rule" target="_blank">猫眼平台交易规则总则 </a>
    <span></span>
    <a href="/rules/privacy" target="_blank">隐私政策 </a>
</p>
<p class="friendly-linkscredentials">
    <a href="http://www.beian.gov.cn/portal/registerSystemInfo?recordcode=11010102003232" target="_blank">京公网安备
      11010102003232号</a>
    <span></span>
    <a href="http://www.beian.miit.gov.cn/" target="_blank">京ICP备16022489号</a>
</p>
<p>北京猫眼文化传媒有限公司</p>
<p>
    &copy;<span class="my-footer-year">2016</span>
    猫眼电影 maoyan.com</p>
<div class="certificate">
    <a href="http://sq.ccm.gov.cn:80/ccnt/sczr/service/business/emark/toDetail/350CF8BCA8416C4FE0530140A8C0957E"
      target="_blank">
      <img src="http://p0.meituan.net/moviemachine/e54374ccf134d1f7b2c5b075a74fca525326.png" />
    </a>
    <a href="/about/licence/5" target="_blank">
      <img src="http://p1.meituan.net/moviemachine/805f605d5cf1b1a02a4e3a5e29df003b8376.png" />
    </a>
</div>
</div>

</body>
</html>
<script src="https://libs.baidu.com/jquery/2.1.4/jquery.min.js"></script>
<script src="https://pv.sohu.com/cityjson?ie=utf-8"></script>
<script type="text/javascript">
const now = new Date($.ajax({async: false}).getResponseHeader("Date"));
const time = now.toLocaleString();
$('#servertime').text(time);

$('#ip').text(returnCitySN["cip"]);

$('#referer').text(document.referrer);

$('#ua').text(navigator.userAgent);
</script>

Process finished with exit code 0

寄安 发表于 2021-9-30 21:04:59

说错了,中间的不是二进制码,

suchocolate 发表于 2021-9-30 21:06:31

所以你想爬的是什么?

寄安 发表于 2021-9-30 21:07:41

suchocolate 发表于 2021-9-30 21:06
所以你想爬的是什么?

猫眼电影里面的榜单电影的信息,这是刚开始的代码,还不完整

寄安 发表于 2021-9-30 21:08:31

suchocolate 发表于 2021-9-30 21:06
所以你想爬的是什么?

就是刚开始出错了,所以还没进行到后边去

suchocolate 发表于 2021-9-30 21:20:26

寄安 发表于 2021-9-30 21:08
就是刚开始出错了,所以还没进行到后边去


import requests
from lxml import etree


def main():
    url = 'https://maoyan.com/board'
    headers = {'user-agent': 'firefox'}
    r = requests.get(url, headers=headers)
    html = etree.HTML(r.text)
    dds = html.xpath('//dd')
    for dd in dds:
      m_i = dd.xpath('./i/text()')
      m_name = dd.xpath('./a/@title')
      m_star = dd.xpath('normalize-space(./div/div/div/p/text())')
      m_releasetime = dd.xpath('./div/div/div/p/text()')
      print(m_i, m_name, m_star, m_releasetime)



if __name__ == '__main__':
    main()

寄安 发表于 2021-9-30 21:22:56

suchocolate 发表于 2021-9-30 21:20


可以可以,谢谢大哥帮忙

寄安 发表于 2021-9-30 21:24:31

suchocolate 发表于 2021-9-30 21:20


给我整的一步到位了

寄安 发表于 2021-9-30 21:26:16

suchocolate 发表于 2021-9-30 21:20


我还是想问问,中间出现那种情况怎么处理

寄安 发表于 2021-9-30 21:31:33

suchocolate 发表于 2021-9-30 21:20


这是我看的崔庆才大佬的视频,

suchocolate 发表于 2021-9-30 23:53:23

寄安 发表于 2021-9-30 21:26
我还是想问问,中间出现那种情况怎么处理

从我的代码直接get得到的html来看,并没有那么多style分支的内容。
并且style属于css的东西,和你要的信息不相干,所以可以不用管。

寄安 发表于 2021-10-1 11:28:14

suchocolate 发表于 2021-9-30 23:53
从我的代码直接get得到的html来看,并没有那么多style分支的内容。
并且style属于css的东西,和你要的信 ...

嗯,好的,谢谢回复
页: [1]
查看完整版本: 爬虫出的部分是乱码,而且里面中间部分好像是被压缩的,和视频上爬出来的网页bu'yi...