关于爬虫代理问题,Python交流,编程语言专区,鱼C论坛

兰竹皋 发表于 2020-11-14 19:02:17

关于爬虫代理问题

本帖最后由兰竹皋于 2020-11-14 22:25 编辑

读者好，近几天练习爬虫时遇到了个问题，希望得到解惑 ^_^, 谢谢。。。

请问，当requests携带了代理ip时，电脑报错：
requests.exceptions.ProxyError: HTTPConnectionPool(host='220.184.38.180', port=9000): Max retries exceeded with url: http://httpbin.org/ip (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x00000294FE623520>: Failed to establish a new connection: 由于目标计算机积极拒绝，无法连接。')))
是为什么？

1q23w31 发表于 2020-11-15 07:12:07

代理不可用，换一个

suchocolate 发表于 2020-11-15 09:13:51

发代码。
HTTP基于是IP层的协议，“携带IP”这个说法不对，就算不代理数据包也是有IP头的，设置代理只是IP头换成了代理服务器的IP和端口，不再是目标服务器的IP和80端口。

兰竹皋 发表于 2020-11-15 10:31:33

suchocolate 发表于 2020-11-15 09:13
发代码。
HTTP基于是IP层的协议，“携带IP”这个说法不对，就算不代理数据包也是有IP头的，设置代理只是IP ...

headers={
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
}
proxy = {
'http':'58.253.157.188:9999'
}

response = requests.get("http://httpbin.org/ip",proxies=proxy,headers=headers)

print(response.text)
就是最简单的代码例子，
在网上，goubanjia、原代理等地方找的免费ip，是因为ip都不能用的原因吗？

suchocolate 发表于 2020-11-15 10:48:15

兰竹皋发表于 2020-11-15 10:31
就是最简单的代码例子，
在网上，goubanjia、原代理等地方找的免费ip，是因为ip都不能用的原因吗？

抓包看了，这个代理服务器鸟都不鸟，所以肯定不能用，找个可以用的代理吧。

兰竹皋 发表于 2020-11-15 13:48:47

suchocolate 发表于 2020-11-15 10:48
抓包看了，这个代理服务器鸟都不鸟，所以肯定不能用，找个可以用的代理吧。

谢谢，那这么说，免费ip几乎都不能用了？
能问问，你们一般都在哪里找代理ip的吗？

suchocolate 发表于 2020-11-15 13:55:08

兰竹皋发表于 2020-11-15 13:48
谢谢，那这么说，免费ip几乎都不能用了？
能问问，你们一般都在哪里找代理ip的吗？

我一般不用，好像大多数都是只能免费试用一会，稳定的都收费，比如每月30元什么的。
你找个代理商，问客户要个试用看看吧，学习的话用试用就行。

兰竹皋 发表于 2020-11-15 14:02:12

suchocolate 发表于 2020-11-15 13:55
我一般不用，好像大多数都是只能免费试用一会，稳定的都收费，比如每月30元什么的。
你找个代理商，问客 ...

恩，谢谢啦（^_^）

兰竹皋 发表于 2020-11-15 14:03:02

1q23w31 发表于 2020-11-15 07:12
代理不可用，换一个

谢谢

兰竹皋 发表于 2020-11-15 15:19:45

suchocolate 发表于 2020-11-15 13:55
我一般不用，好像大多数都是只能免费试用一会，稳定的都收费，比如每月30元什么的。
你找个代理商，问客 ...

你好，还在吗？能帮我试试这代码吗？
我连续在www.goubanjia.com上爬了数百代理ip没一个可以用，很奇怪！！！，弄得我怀疑是自己电脑问题

from multiprocessing.pool import ThreadPool
import requests
from lxml import etree
import time

def open_url(url):
headers = {
         'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
}
re = requests.get(url=url, headers=headers)
re.raise_for_status()
re.encoding = 'utf-8'
return re.text

def open_url_http(proxy):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',}
try:
   res = requests.get(url=url_http, headers=headers, proxy=proxy)
   assert res.status_code == 200
   print('http://'+proxy, end='\t')
   con = res.json()['origin']
   if con == proxy:
         print('http://'+proxy+' 可以使用')
   else:
         print('error')
except Exception as e:
pass

def open_url_https(proxy):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',}
try:
   res = requests.get(url=url_https, headers=headers, proxy=proxy)
   assert res.status_code == 200
   print('https://'+proxy, end='\t')
   tree = etree.HTML(res.text)
   con = tree.xpath('//body/p//text()')
   if con == proxy:
         print('https://'+proxy+' 可以使用')
   else:
         print('error')
except Exception as e:
pass

def get_proxies():
proxies_https = []
proxies_http = []
url = 'http://www.goubanjia.com/'
re_text = open_url(url=url)

tree = etree.HTML(re_text)
info_list = tree.xpath('//tbody/tr')
for each in info_list:
   info_dict = {}

   each_address_list = []
   info_dict['address'] = each.xpath('./td//*')
   for each_address in info_dict['address']:
         if each_address.xpath('./@style') == ["display: none;"] or each_address.xpath('./@style') ==["display:none;"]:
            pass
         else:
            if len(each_address.xpath('./text()')) != 0:
               each_address_list.append(each_address.xpath('./text()'))
   info_dict['address'] = ''.join(each_address_list[:-1])+':'+each_address_list[-1]

   info_dict['anonymity'] = each.xpath('./td//text()')
   info_dict['protocal'] = each.xpath('./td//text()')
   if info_dict['protocal'] == 'http':
         if info_dict['address'] not in proxies_http:
            proxies_http.append(str(info_dict['address']))
   elif info_dict['protocal'] == 'https':
         if info_dict['address'] not in proxies_https:
            proxies_https.append(str(info_dict['address']))

print(proxies_http)
print(proxies_https)
return proxies_http, proxies_https

if __name__ == '__main__':

url_https = 'https://202020.ip138.com/'
url_http = 'http://httpbin.org/ip'

def main():
   proxies_http, proxies_https = get_proxies()
   pool = ThreadPool(5)
   print('main start')
   for each_proxy in proxies_http:
         proxy = {"http":f"{each_proxy}"}
         pool.apply_async(open_url_http, (proxy,))
   for each_proxy in proxies_https:
         proxy = {"https":f"{each_proxy}"}
         pool.apply_async(open_url_https, (proxy,))
   pool.close()
   pool.join()
   print('main over')

for i in range(5):
   main()
   time.sleep(0.5)

suchocolate 发表于 2020-11-15 18:37:33

本帖最后由 suchocolate 于 2020-11-15 18:43 编辑

兰竹皋发表于 2020-11-15 15:19
你好，还在吗？能帮我试试这代码吗？
我连续在www.goubanjia.com上爬了数百代理ip没一个可以用，很奇 ...

得联系客服要免费试用账号，他们贴出来的已经提示不保证可用了。

页: [1]

鱼C论坛's Archiver

关于爬虫代理问题