是阿佳! 发表于 2022-1-15 22:42:41

python 爬取豆瓣 被封ip怎么办

本帖最后由 是阿佳! 于 2022-1-18 14:08 编辑

已经用代理、写headers,但是还是出现这样的问题

import pickle, random, requests, bs4

def loadips():

    with open('ips2.pkl', 'rb') as f:
      ips = pickle.load(f)

    return ips



def getSoup(ips):
   
    headers = {"User-Agent":"ozilla/5.0 (Windows NT 10.0; " \
               + "Win64; x64) AppleWebKit/537.36 (KHTML, like " \
               + "Gecko) Chrome/97.0.4692.71 Safari/537.36"}
   
    hosts, nums, soups = [], , []


    for i in nums:
      hosts.append("https://movie.douban.com/top250?start=" + str(i) +"&filter=")


    for i in hosts:
      
      proxy = {'http': random.choice(ips)}
      res = requests.get(i, \
                     headers=headers, proxies=proxy)

      html = bs4.BeautifulSoup(res.text, "html.parser")
      soups.append(html)
      print(proxy)
      print(html)
      print('='*100)

    return soups



getSoup(loadips())



打印结果:


{'http': '219.246.65.55:80'}
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>豆瓣 - 登录跳转页</title>
<style type="text/css">
      body{font-family:Arial,Helvetica,sans-serif;font-size:14px;}
      h1{font-size:25px;margin:25px 0 10px 0;}
    </style>
</head>
<body>
<div>
<div style="margin:20px auto;">
<div style="font-size:25px;color:#1b9336;border-bottom:5px solid #eef9eb">
<span style="font-size:20px;font-weight:bold">豆瓣</span> d<span style="color:#0092c8">o</span><span style="color:#ffad68">u</span><span>b</span><span style="color:#0092c8">a</span><span style="color:#ffad68">n</span>
</div>
<h1>登录跳转</h1>
<div><p>有异常请求从你的 IP 发出,请 <a href="https://accounts.douban.com/passport/login?redir=https%3A%2F%2Fmovie.douban.com%2Ftop250%3Fstart%3D0%26filter%3D">登录</a> 使用豆瓣</p></div>
</div>
</div>
</body>
</html>

====================================================================================================
{'http': '222.74.73.202:42055'}
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>豆瓣 - 登录跳转页</title>
<style type="text/css">
      body{font-family:Arial,Helvetica,sans-serif;font-size:14px;}
      h1{font-size:25px;margin:25px 0 10px 0;}
    </style>
</head>
<body>
<div>
<div style="margin:20px auto;">
<div style="font-size:25px;color:#1b9336;border-bottom:5px solid #eef9eb">
<span style="font-size:20px;font-weight:bold">豆瓣</span> d<span style="color:#0092c8">o</span><span style="color:#ffad68">u</span><span>b</span><span style="color:#0092c8">a</span><span style="color:#ffad68">n</span>
</div>
<h1>登录跳转</h1>
<div><p>有异常请求从你的 IP 发出,请 <a href="https://accounts.douban.com/passport/login?redir=https%3A%2F%2Fmovie.douban.com%2Ftop250%3Fstart%3D25%26filter%3D">登录</a> 使用豆瓣</p></div>
</div>
</div>
</body>
</html>

====================================================================================================
{'http': '59.63.74.63:8118'}
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>豆瓣 - 登录跳转页</title>
<style type="text/css">
      body{font-family:Arial,Helvetica,sans-serif;font-size:14px;}
      h1{font-size:25px;margin:25px 0 10px 0;}
    </style>
</head>
<body>
<div>
<div style="margin:20px auto;">
<div style="font-size:25px;color:#1b9336;border-bottom:5px solid #eef9eb">
<span style="font-size:20px;font-weight:bold">豆瓣</span> d<span style="color:#0092c8">o</span><span style="color:#ffad68">u</span><span>b</span><span style="color:#0092c8">a</span><span style="color:#ffad68">n</span>
</div>
<h1>登录跳转</h1>
<div><p>有异常请求从你的 IP 发出,请 <a href="https://accounts.douban.com/passport/login?redir=https%3A%2F%2Fmovie.douban.com%2Ftop250%3Fstart%3D50%26filter%3D">登录</a> 使用豆瓣</p></div>
</div>
</div>
</body>
</html>


我不用代理也是这个样子
import requests, bs4

def getSoup():
   
    headers = {"User-Agent":"ozilla/5.0 (Windows NT 10.0; " \
               + "Win64; x64) AppleWebKit/537.36 (KHTML, like " \
               + "Gecko) Chrome/97.0.4692.71 Safari/537.36"}
   
    hosts, nums, soups = [], , []


    for i in nums:
      hosts.append("https://movie.douban.com/top250?start=" + str(i) +"&filter=")


    for i in hosts:

      res = requests.get(i, \
                     headers=headers)

      html = bs4.BeautifulSoup(res.text, "html.parser")
      soups.append(html)
      print(html)
      print('='*100)

    return soups



getSoup()

<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>豆瓣 - 登录跳转页</title>
<style type="text/css">
      body{font-family:Arial,Helvetica,sans-serif;font-size:14px;}
      h1{font-size:25px;margin:25px 0 10px 0;}
    </style>
</head>
<body>
<div>
<div style="margin:20px auto;">
<div style="font-size:25px;color:#1b9336;border-bottom:5px solid #eef9eb">
<span style="font-size:20px;font-weight:bold">豆瓣</span> d<span style="color:#0092c8">o</span><span style="color:#ffad68">u</span><span>b</span><span style="color:#0092c8">a</span><span style="color:#ffad68">n</span>
</div>
<h1>登录跳转</h1>
<div><p>有异常请求从你的 IP 发出,请 <a href="https://accounts.douban.com/passport/login?redir=https%3A%2F%2Fmovie.douban.com%2Ftop250%3Fstart%3D0%26filter%3D">登录</a> 使用豆瓣</p></div>
</div>
</div>
</body>
</html>

====================================================================================================
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<title>豆瓣 - 登录跳转页</title>
<style type="text/css">
      body{font-family:Arial,Helvetica,sans-serif;font-size:14px;}
      h1{font-size:25px;margin:25px 0 10px 0;}
    </style>
</head>
<body>
<div>
<div style="margin:20px auto;">
<div style="font-size:25px;color:#1b9336;border-bottom:5px solid #eef9eb">
<span style="font-size:20px;font-weight:bold">豆瓣</span> d<span style="color:#0092c8">o</span><span style="color:#ffad68">u</span><span>b</span><span style="color:#0092c8">a</span><span style="color:#ffad68">n</span>
</div>
<h1>登录跳转</h1>
<div><p>有异常请求从你的 IP 发出...

但是用浏览器却能正常访问,是不是cookie的问题?????

这里改UA,加上cookie和time.sleep(1)即可

王尧 发表于 2022-1-15 22:45:22

都被人封了, 就算了吧 ,哥{:5_97:}

是阿佳! 发表于 2022-1-15 22:48:35

王尧 发表于 2022-1-15 22:45
都被人封了, 就算了吧 ,哥

是我找的ip不靠谱吗

王尧 发表于 2022-1-15 23:11:43

是阿佳! 发表于 2022-1-15 22:48
是我找的ip不靠谱吗

不好意思,我不懂python,我只会web

isdkz 发表于 2022-1-16 07:16:38

是你的UA的问题吧,“Mozilla”少了个“M”

100gram 发表于 2022-1-16 10:55:06

{:10_245:}

myqf123 发表于 2022-1-16 11:07:12

无解

tianlai7266 发表于 2022-1-16 12:56:31

{:10_256:}

AdiosSpike 发表于 2022-1-16 13:11:02

谢谢

foxiangzun 发表于 2022-1-16 13:18:56

没事,貌似豆瓣的解封时间是 48 小时,如果你能用浏览器正常访问豆瓣的话,那就没什么问题了,之前我也被封过,后来过了 48 小时就解封了,耐心等待吧

233073524 发表于 2022-1-16 13:36:10

你连一个sleep都不肯加,怎么能不封你的ip,你这种访问行为和ddos有什么两样呢?

沮授 发表于 2022-1-16 13:44:20

还能访问应该没被封ip吧{:10_243:}

darrenkwan 发表于 2022-1-16 14:51:48

封了就算了吧,哥

python爱好者. 发表于 2022-1-17 09:18:52

.

JingClytze 发表于 2022-1-17 11:41:57

不懂

jiminli88 发表于 2022-1-17 20:39:01

厉害啊,都玩爬虫了

伽羅~ 发表于 2022-1-24 10:46:18

{:10_254:}

qwb1997 发表于 2022-1-24 13:04:47

{:10_279:}

qwb1997 发表于 2022-1-24 14:17:47

{:10_279:}

swanseabrian 发表于 2022-2-7 10:54:20

代理ip从哪里搞的呀,老大,我也想采集一下
页: [1]
查看完整版本: python 爬取豆瓣 被封ip怎么办