|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 风不会停息 于 2018-9-2 21:21 编辑
1. 服务器通过发送的 HTTP 头中的 User-Agent 来进行识别浏览器与非浏览器,服务器还以 User-Agent 来区分各个浏览器。
2. 在用python写爬虫时, 在需要的时候可以通过设置User-Agent伪装成浏览器访问, 有两种方法:
1. urllib.request.Request(url, data, headers), 通过传入headers参数来设置User-Agent, headers为一个字典, 可以设置为:
- head['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36'
复制代码
2. 利用add_header() 方法往 Request 对象添加 headers, 例如:
- req = urllib.request.Request(url, data)
- req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')
复制代码
3. 代理: python使用代理访问服务器主要有一下3个步骤:
1.创建一个代理处理器ProxyHandler:
proxy_support = urllib.request.ProxyHandler(),ProxyHandler是一个类,其参数是一个字典:{ '类型':'代理ip:端口号'}
什么是Handler?Handler也叫作处理器,每个handlers知道如何通过特定协议打开URLs,或者如何处理URL打开时的各个方面,例如HTTP重定向或者HTTP cookies。
2.定制、创建一个opener:
opener = urllib.request.build_opener(proxy_support)
什么是opener?python在打开一个url链接时,就会使用opener。其实,urllib.request.urlopen()函数实际上是使用的是默认的opener,只不过在这里我们需要定制一个opener来指定handler。
3a.安装opener
urllib.request.install_opener(opener)
install_opener 用来创建(全局)默认opener,这个表示调用urlopen将使用你安装的opener。
3b.调用opener
opener.open(url)
该方法可以像urlopen函数那样直接用来获取urls:通常不必调用install_opener,除了为了方便。
- import urllib.request
- import random
- url = "https://www.ip.cn/"
- iplist = ['121.43.170.207:3128']
- proxy_support = urllib.request.ProxyHandler({'http' : random.choice(iplist)})
- opener = urllib.request.build_opener(proxy_support)
- opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36')] #伪装成浏览器访问
- urllib.request.install_opener(opener)
- response = urllib.request.urlopen(url)
- html = response.read().decode('utf-8')
- print(html)
复制代码
4. Beautiful Soup 4 模块中文文档: https://www.crummy.com/software/ ... c.zh/index.html#id9
5. 正则表达式介绍和简单应用:
1. 介绍: http://fishc.com.cn/forum.php?mo ... peid%26typeid%3D403
2. 简单应用: http://fishc.com.cn/forum.php?mo ... peid%26typeid%3D403
动动手1代码:
- import urllib.request
- import urllib.parse
- import re
- from bs4 import BeautifulSoup
- def main():
- keyword = input("请输入关键词: ")
- string_parameters = {'word' : keyword}
- keyword = urllib.parse.urlencode(string_parameters)
-
- url = "http://baike.baidu.com/search/word?%s" % keyword
- response = urllib.request.urlopen(url)
- html = response.read()
- soup = BeautifulSoup(html, "html.parser")
- for each in soup.find_all( href = re.compile('item') ):
- content = "".join([each.text])
- url2 = "".join(["http://baike.baidu.com", each['href']])
- response2 = urllib.request.urlopen(url2)
- html2 = response2.read()
- soup2 = BeautifulSoup(html2, "html.parser")
- if soup2.h2:
- content = "".join([content, soup2.h2.text])
- content = "".join([content, ' -> ', url2])
- print(content)
- if __name__ == "__main__":
- main()
复制代码
动动手2代码:
- import urllib.request
- import urllib.parse
- import re
- from bs4 import BeautifulSoup
- def test_url(soup):
- result = soup.find(text = re.compile("百度百科尚未收录词条"))
- if result:
- print(result[0 : -1])
- return False
- else:
- return True
- def summary(soup):
- word = soup.h1.text
- if soup.h2:
- word += soup.h2.text
- print(word)
- if soup.find(class_ = "lemma-summary"):
- print(soup.find(class_ = "lemma-summary").text)
- def get_urls(soup):
- for each in soup.find_all( href = re.compile('item') ):
- content = "".join([each.text])
- url2 = "".join(["http://baike.baidu.com", each['href']])
- response2 = urllib.request.urlopen(url2)
- html2 = response2.read()
- soup2 = BeautifulSoup(html2, "html.parser")
- if soup2.h2:
- content = "".join([content, soup2.h2.text])
- content = "".join([content, ' -> ', url2])
- yield content
- def main():
- keyword = input("请输入关键词: ")
- string_parameters = {'word' : keyword}
- keyword = urllib.parse.urlencode(string_parameters)
- url = "http://baike.baidu.com/search/word?%s" % keyword
- response = urllib.request.urlopen(url)
- html = response.read()
- soup = BeautifulSoup(html, "html.parser")
- if test_url(soup):
- summary(soup)
- command = input("是否打印相关链接(Y / N): ")
- if command == 'N':
- print("程序结束!")
- elif command == 'Y':
- print("下面打印相关链接: ")
- each = get_urls(soup)
- while True:
- try:
- for i in range(10):
- print(next(each))
- except StopIteration:
- break
- command = input("输入任意字符将继续打印, 输入a则会打印剩下所有内容(ctrl + c 退出程序), 输入q退出程序: ")
- if command == 'q':
- break
- elif command == 'a':
- try:
- for i in each:
- print(i)
- except KeyboardInterrupt:
- print("程序退出!")
- break
- else:
- continue
- if __name__ == "__main__":
- main()
复制代码 |
评分
-
查看全部评分
|