鱼C论坛 › 张丫彦丫楚 › 日志

张丫彦丫楚

https://fishc.com.cn/?568501

55论一只爬虫的自我修养3：隐藏

已有 504 次阅读2019-1-21 13:37

关于网址的问题：

打开功能界面，进行操作，network处最大的，the topest Name对应着

General

Request URL:
https://baike.baidu.com/search/word?word=%E7%8C%AA%E5%85%AB%E6%88%92
Request Method:
GET
Status Code:
302 Found
Remote Address:
61.135.185.24:443
Referrer Policy:
unsafe-url

Response Headers

Request Headers

Query String Parameters（1）

word:

猪八戒

等信息，动手题，猪八戒对应的跳转网址及形式对应的就是General Request URL内容

参考楼层1697、1688

问题1:1700 问题2：1707

[code]

# -*- coding: utf-8 -*-

import urllib.request

import urllib.parse

from bs4 import BeautifulSoup

import re

import random

def main():

#搜查内容

search = input('请输入关键词：')

search = urllib.parse.urlencode({'word':search})

url = 'https://baike.baidu.com/search/word?%s' %search

#建立代理

ip1 = '117.191.11.111:8080'

ip2 = '222.223.115.30:41303'

ip3 = '121.61.0.86:9999'

ip4 = '101.251.216.103:8080'

iplist=[ip1,ip2,ip3,ip4]

proxy_support = urllib.request.ProxyHandler({'http':random.choice(iplist)})

opener = urllib.request.build_opener(proxy_support)

opener.addheader = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')]

urllib.request.install_opener(opener)

#进入词条

response = urllib.request.urlopen(url)

html = response.read().decode('utf-8')

soup = BeautifulSoup(html,'html.parser')

#进入具体词条内容

for each in soup.find_all(href = re.compile('item')):

url2 = ''.join(['https://baike.baidu.com',each['href']])

#处理秒懂星课堂等链接的汉字问题

if re.search(u'[\u4e00-\u9fa5]+',url2):

result = re.search(u'[\u4e00-\u9fa5]+',url2)

context = result.group()