python 055 爬虫之隐藏和代理

风不会停息 · 发表于 2018-7-31 01:11:29

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由风不会停息于 2018-9-2 21:21 编辑

1. 服务器通过发送的 HTTP 头中的 User-Agent 来进行识别浏览器与非浏览器，服务器还以 User-Agent 来区分各个浏览器。

2. 在用python写爬虫时，在需要的时候可以通过设置User-Agent伪装成浏览器访问，有两种方法：
1. urllib.request.Request(url, data, headers)，通过传入headers参数来设置User-Agent， headers为一个字典，可以设置为：

head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'

复制代码

2. 利用add_header() 方法往 Request 对象添加 headers，例如：

req = urllib.request.Request(url, data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')

复制代码

3. 代理： python使用代理访问服务器主要有一下3个步骤：
1.创建一个代理处理器ProxyHandler：
proxy_support = urllib.request.ProxyHandler()，ProxyHandler是一个类，其参数是一个字典：{ '类型':'代理ip:端口号'}
什么是Handler？Handler也叫作处理器，每个handlers知道如何通过特定协议打开URLs，或者如何处理URL打开时的各个方面，例如HTTP重定向或者HTTP cookies。

2.定制、创建一个opener：
opener = urllib.request.build_opener(proxy_support)
什么是opener？python在打开一个url链接时，就会使用opener。其实，urllib.request.urlopen()函数实际上是使用的是默认的opener，只不过在这里我们需要定制一个opener来指定handler。

3a.安装opener
urllib.request.install_opener(opener)
install_opener 用来创建（全局）默认opener，这个表示调用urlopen将使用你安装的opener。

3b.调用opener
opener.open(url)
该方法可以像urlopen函数那样直接用来获取urls：通常不必调用install_opener，除了为了方便。

import urllib.request
import random
url = "https://www.ip.cn/"
iplist = ['121.43.170.207:3128']
proxy_support = urllib.request.ProxyHandler({'http' : random.choice(iplist)})
opener = urllib.request.build_opener(proxy_support)
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')] #伪装成浏览器访问
urllib.request.install_opener(opener)
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')
print(html)

复制代码

4. Beautiful Soup 4 模块中文文档： https://www.crummy.com/software/ ... c.zh/index.html#id9

5. 正则表达式介绍和简单应用：
1. 介绍： http://fishc.com.cn/forum.php?mo ... peid%26typeid%3D403
2. 简单应用： http://fishc.com.cn/forum.php?mo ... peid%26typeid%3D403

动动手1代码：

import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup
def main():
keyword = input("请输入关键词： ")
string_parameters = {'word' : keyword}
keyword = urllib.parse.urlencode(string_parameters)
url = "http://baike.baidu.com/search/word?%s" % keyword
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, "html.parser")
for each in soup.find_all( href = re.compile('item') ):
content = "".join([each.text])
url2 = "".join(["http://baike.baidu.com", each['href']])
response2 = urllib.request.urlopen(url2)
html2 = response2.read()
soup2 = BeautifulSoup(html2, "html.parser")
if soup2.h2:
content = "".join([content, soup2.h2.text])
content = "".join([content, ' -> ', url2])
print(content)
if __name__ == "__main__":
main()

复制代码

动动手2代码：

import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup
def test_url(soup):
result = soup.find(text = re.compile("百度百科尚未收录词条"))
if result:
print(result[0 : -1])
return False
else:
return True
def summary(soup):
word = soup.h1.text
if soup.h2:
word += soup.h2.text
print(word)
if soup.find(class_ = "lemma-summary"):
print(soup.find(class_ = "lemma-summary").text)
def get_urls(soup):
for each in soup.find_all( href = re.compile('item') ):
content = "".join([each.text])
url2 = "".join(["http://baike.baidu.com", each['href']])
response2 = urllib.request.urlopen(url2)
html2 = response2.read()
soup2 = BeautifulSoup(html2, "html.parser")
if soup2.h2:
content = "".join([content, soup2.h2.text])
content = "".join([content, ' -> ', url2])
yield content
def main():
keyword = input("请输入关键词： ")
string_parameters = {'word' : keyword}
keyword = urllib.parse.urlencode(string_parameters)
url = "http://baike.baidu.com/search/word?%s" % keyword
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, "html.parser")
if test_url(soup):
summary(soup)
command = input("是否打印相关链接（Y / N）： ")
if command == 'N':
print("程序结束！")
elif command == 'Y':
print("下面打印相关链接： ")
each = get_urls(soup)
while True:
try:
for i in range(10):
print(next(each))
except StopIteration:
break
command = input("输入任意字符将继续打印，输入a则会打印剩下所有内容（ctrl + c 退出程序），输入q退出程序： ")
if command == 'q':
break
elif command == 'a':
try:
for i in each:
print(i)
except KeyboardInterrupt:
print("程序退出！")
break
else:
continue
if __name__ == "__main__":
main()

复制代码

昼木2333 · 发表于 2020-2-21 18:40:53

楼主我用你那个代理ip的代码返回的还是自己的ip啊请问是怎么一回事

donaldl8 · 发表于 2020-3-6 17:06:45

昼木2333 发表于 2020-2-21 18:40
楼主我用你那个代理ip的代码返回的还是自己的ip啊请问是怎么一回事

我也是啊，这是什么问题。。按照小甲鱼的视频来的，还是自己的ip

账号		自动登录	找回密码
密码			立即注册

[技术交流] python 055 爬虫之隐藏和代理

马上注册，结交更多好友，享用更多功能^_^

评分

本帖被以下淘专辑推荐:

浏览过的版块