刚学到爬虫XXOO，遇到403forbidden

夏夜夏月 · 发表于 2015-6-11 21:20:04

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

下面是代码，跟着小甲鱼编的，已经添加header了，不知道为什么煎蛋还是把我禁了难道要用代理？？
请问怎么解决~？？

import urllib.request
import os
def url_open(url):
req = urllib.request.Request(url)
req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36')
response = urllib.request.urlopen(url)
html = response.read()
return html
def get_page(url):
html = url_open(url).decode('utf-8')
a = html.find('current-comment-page') + 23
b = html.find(']',a)
return html[a:b]
def find_imgs(url):
html = url_open(url).decode('utf-8')
img_addrs = []
a = html.find('img src=')
while a != -1:
b = html.find('.jpg',a,a +255)
if b !=-1:
img_addrs.append(html[a+9:b+4])
else:
b = a +9
a = html.find('img src=',b)
return img_addrs
def save_imgs(folder, img_addrs):
for each in img_addrs:
filename = each.split('/')[-1]
with open(filename,'wb') as f:
img =url_open(each)
f.write(img)
def download_mm(folder='OOXX',pages=10):
os.mkdir(folder)
os.chdir(folder)
url = 'http://jandan.net/ooxx/'
page_num = int(get_page(url))
for i in range(pages):
page_num -= i
page_url = url + 'page-' + str(page_num) + '#comments'
img_addrs = find_imgs(page_url)
save_imgs(img_addrs)
if __name__ == '__main__':
download_mm()

复制代码

这是错误信息：
Traceback (most recent call last):
  File "C:/Python34/testpython/爬煎蛋的妹纸.py", line 60, in <module>
download_mm()
  File "C:/Python34/testpython/爬煎蛋的妹纸.py", line 51, in download_mm
page_num = int(get_page(url))
  File "C:/Python34/testpython/爬煎蛋的妹纸.py", line 13, in get_page
html = url_open(url).decode('utf-8')
  File "C:/Python34/testpython/爬煎蛋的妹纸.py", line 7, in url_open
response = urllib.request.urlopen(url)
  File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
  File "C:\Python34\lib\urllib\request.py", line 469, in open
response = meth(req, response)
  File "C:\Python34\lib\urllib\request.py", line 579, in http_response
'http', request, response, code, msg, hdrs)
  File "C:\Python34\lib\urllib\request.py", line 507, in error
return self._call_chain(*args)
  File "C:\Python34\lib\urllib\request.py", line 441, in _call_chain
result = func(*args)
  File "C:\Python34\lib\urllib\request.py", line 587, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

夏夜夏月 · 发表于 2015-6-13 13:07:23

:cry自己顶

kmaster · 发表于 2015-6-14 16:44:51

response = urllib.request.urlopen(req)

邻家老王 · 发表于 2015-8-1 21:52:59

mark 煎蛋被爬得太多了

戴宇轩 · 发表于 2015-8-2 20:22:45

小甲鱼的课太吸引人，大家都来爬煎蛋，所以煎蛋把爬虫封了…

半杯茶不要抢 · 发表于 2015-8-24 16:45:42

data=None
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'}

req=urllib.request.Request(url_meizi,data,headers)

小甲鱼 · 发表于 2015-8-25 01:18:21

戴宇轩发表于 2015-8-2 20:22
小甲鱼的课太吸引人，大家都来爬煎蛋，所以煎蛋把爬虫封了…

是的……后来煎蛋封了爬虫……

大家还是换个对象学习吧，一般服务器都不欢迎爬虫“光临”的~

jiagd0105 · 发表于 2015-8-25 10:52:22

小甲鱼发表于 2015-8-25 01:18
是的……后来煎蛋封了爬虫……

大家还是换个对象学习吧，一般服务器都不欢迎爬虫“光临”的~

同样发现这个问题，没法爬妹子图了。。。

vinking93 · 发表于 2015-8-26 11:01:54

怎么解决的，我也遇到这样的问题

249018563 · 发表于 2015-8-26 11:17:15

import urllib.request

data=None
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36'}
req=urllib.request.Request('http://jandan.net/ooxx',data,headers)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')

a = html.find('current-comment-page') + 23
b = html.find(']',a)
print(html[a:b])

249018563 · 发表于 2015-8-26 11:20:43

可以通过

大大张宇 · 发表于 2015-9-15 21:17:15

用10楼的方法解决了，很不错
不过我的图片保存，打开不了，显示文件可能已损坏

xcxyxxjsx · 发表于 2015-12-22 09:46:29

小甲鱼发表于 2015-8-25 01:18
是的……后来煎蛋封了爬虫……

大家还是换个对象学习吧，一般服务器都不欢迎爬虫“光临”的~

此种办法有解么？现在大部分网站都不喜欢爬虫吧，尤其是一些视频，音乐等一些站点。但这些东西又是最喜欢爬的。有解么，大湿

zxszx4 · 发表于 2015-12-22 23:27:33

249018563 发表于 2015-8-26 11:17
import urllib.request

data=None

大神，为什么我运行这个报错，你给看看呗？

import urllib.request
url='http://jandan.net/ooxx'
data=None
headers={
'Host': 'jandan.net',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:42.0) Gecko/20100101 Firefox/42.0',
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'X-Requested-With': 'XMLHttpRequest',
'Referer': 'http://jandan.net/ooxx',
'Cookie': '4022519986=7; _ga=GA1.2.43294716.1450792083; Hm_lvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1450792084; Hm_lpvt_fd93b7fb546adcfbcf80c4fc2b54da2c=1450794167',
'x-forwarded-for': '114.114.114.114',
'Connection': 'keep-alive'
}
req = urllib.request.Request(url,data,headers)
response = urllib.request.urlopen(req)
html = response.read().decode("utf-8")
a=html.find('current-comment-page')+23
b=html.find(']',a)
print(html)

复制代码

报错信息

zxszx4 · 发表于 2015-12-23 10:39:08

额，我知道为什么了，因为我在head里加了'Accept-Encoding': 'gzip, deflate'这句，哎，悲伤之情无法言表呀。

1137668129 · 发表于 2016-2-17 23:02:19

淡淡的忧伤，还指望能下点ooxx图片呢

worry921 · 发表于 2016-2-18 08:34:53

厉害啊，学到多少课，就可以煎蛋了

wanglong12341 · 发表于 2016-2-18 15:29:46

提示: 作者被禁止或删除内容自动屏蔽

westel · 发表于 2016-3-7 11:19:36

感谢zxszx4，煎蛋网又可以爬虫了

RUCJack · 发表于 2016-3-22 15:27:37

10楼的方法试了一下，还是不行。使用代理服务器，亲测可行：
proxy_support = urllib.request.ProxyHandler({'http':'222.220.113.4:3128'})
opener = urllib.request.build_opener(proxy_support)
opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36')]
response = opener.open(url)
测试时间是2016年3月22日

账号		自动登录	找回密码
密码			立即注册

wanglong12341 wanglong12341 当前离线 UID 349835 日志相册贡献荣誉积分 160 狗仔卡头像被屏蔽	发表于 2016-2-18 15:29:46 \| 显示全部楼层提示: 作者被禁止或删除内容自动屏蔽
	小甲鱼最新课程 -> https://ilovefishc.com
	回复使用道具举报显身卡

刚学到爬虫XXOO，遇到403forbidden

马上注册，结交更多好友，享用更多功能^_^

浏览过的版块