马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 风不会停息 于 2018-9-2 21:21 编辑
1. 服务器通过发送的 HTTP 头中的 User-Agent 来进行识别浏览器与非浏览器,服务器还以 User-Agent 来区分各个浏览器。
2. 在用python写爬虫时, 在需要的时候可以通过设置User-Agent伪装成浏览器访问, 有两种方法:
1. urllib.request.Request(url, data, headers), 通过传入headers参数来设置User-Agent, headers为一个字典, 可以设置为:
head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
2. 利用add_header() 方法往 Request 对象添加 headers, 例如:
req = urllib.request.Request(url, data)
req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')
3. 代理: python使用代理访问服务器主要有一下3个步骤:
1.创建一个代理处理器ProxyHandler:
proxy_support = urllib.request.ProxyHandler(),ProxyHandler是一个类,其参数是一个字典:{ '类型':'代理ip:端口号'}
什么是Handler?Handler也叫作处理器,每个handlers知道如何通过特定协议打开URLs,或者如何处理URL打开时的各个方面,例如HTTP重定向或者HTTP cookies。
2.定制、创建一个opener:
opener = urllib.request.build_opener(proxy_support)
什么是opener?python在打开一个url链接时,就会使用opener。其实,urllib.request.urlopen()函数实际上是使用的是默认的opener,只不过在这里我们需要定制一个opener来指定handler。
3a.安装opener
urllib.request.install_opener(opener)
install_opener 用来创建(全局)默认opener,这个表示调用urlopen将使用你安装的opener。
3b.调用opener
opener.open(url)
该方法可以像urlopen函数那样直接用来获取urls:通常不必调用install_opener,除了为了方便。import urllib.request
import random
url = "https://www.ip.cn/"
iplist = ['121.43.170.207:3128']
proxy_support = urllib.request.ProxyHandler({'http' : random.choice(iplist)})
opener = urllib.request.build_opener(proxy_support)
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')] #伪装成浏览器访问
urllib.request.install_opener(opener)
response = urllib.request.urlopen(url)
html = response.read().decode('utf-8')
print(html)
4. Beautiful Soup 4 模块中文文档: https://www.crummy.com/software/ ... c.zh/index.html#id9
5. 正则表达式介绍和简单应用:
1. 介绍: http://fishc.com.cn/forum.php?mo ... peid%26typeid%3D403
2. 简单应用: http://fishc.com.cn/forum.php?mo ... peid%26typeid%3D403
动动手1代码:
import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup
def main():
keyword = input("请输入关键词: ")
string_parameters = {'word' : keyword}
keyword = urllib.parse.urlencode(string_parameters)
url = "http://baike.baidu.com/search/word?%s" % keyword
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, "html.parser")
for each in soup.find_all( href = re.compile('item') ):
content = "".join([each.text])
url2 = "".join(["http://baike.baidu.com", each['href']])
response2 = urllib.request.urlopen(url2)
html2 = response2.read()
soup2 = BeautifulSoup(html2, "html.parser")
if soup2.h2:
content = "".join([content, soup2.h2.text])
content = "".join([content, ' -> ', url2])
print(content)
if __name__ == "__main__":
main()
动动手2代码:
import urllib.request
import urllib.parse
import re
from bs4 import BeautifulSoup
def test_url(soup):
result = soup.find(text = re.compile("百度百科尚未收录词条"))
if result:
print(result[0 : -1])
return False
else:
return True
def summary(soup):
word = soup.h1.text
if soup.h2:
word += soup.h2.text
print(word)
if soup.find(class_ = "lemma-summary"):
print(soup.find(class_ = "lemma-summary").text)
def get_urls(soup):
for each in soup.find_all( href = re.compile('item') ):
content = "".join([each.text])
url2 = "".join(["http://baike.baidu.com", each['href']])
response2 = urllib.request.urlopen(url2)
html2 = response2.read()
soup2 = BeautifulSoup(html2, "html.parser")
if soup2.h2:
content = "".join([content, soup2.h2.text])
content = "".join([content, ' -> ', url2])
yield content
def main():
keyword = input("请输入关键词: ")
string_parameters = {'word' : keyword}
keyword = urllib.parse.urlencode(string_parameters)
url = "http://baike.baidu.com/search/word?%s" % keyword
response = urllib.request.urlopen(url)
html = response.read()
soup = BeautifulSoup(html, "html.parser")
if test_url(soup):
summary(soup)
command = input("是否打印相关链接(Y / N): ")
if command == 'N':
print("程序结束!")
elif command == 'Y':
print("下面打印相关链接: ")
each = get_urls(soup)
while True:
try:
for i in range(10):
print(next(each))
except StopIteration:
break
command = input("输入任意字符将继续打印, 输入a则会打印剩下所有内容(ctrl + c 退出程序), 输入q退出程序: ")
if command == 'q':
break
elif command == 'a':
try:
for i in each:
print(i)
except KeyboardInterrupt:
print("程序退出!")
break
else:
continue
if __name__ == "__main__":
main()
|