鱼C论坛

 找回密码
 立即注册

55论一只爬虫的自我修养3:隐藏

已有 494 次阅读2019-1-21 13:37

关于网址的问题:
打开功能界面,进行操作,network处最大的,the topest Name对应着
General
  1. Request URL:
    https://baike.baidu.com/search/word?word=%E7%8C%AA%E5%85%AB%E6%88%92
  2. Request Method:
    GET
  3. Status Code:
    302 Found
  4. Remote Address:
    61.135.185.24:443
  5. Referrer Policy:
    unsafe-url
Response Headers
Request Headers
Query String Parameters(1)
word:
猪八戒
等信息,动手题,猪八戒对应的跳转网址及形式对应的就是General Request URL内容
参考楼层1697、1688
问题1:1700 问题2:1707
[code]
# -*- coding: utf-8 -*-
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
import random

def main():

    #搜查内容
    search = input('请输入关键词:')
    search = urllib.parse.urlencode({'word':search})
    url = 'https://baike.baidu.com/search/word?%s' %search


    #建立代理
    ip1 = '117.191.11.111:8080'
    ip2 = '222.223.115.30:41303'
    ip3 = '121.61.0.86:9999'
    ip4 = '101.251.216.103:8080'

    iplist=[ip1,ip2,ip3,ip4]

    proxy_support = urllib.request.ProxyHandler({'http':random.choice(iplist)})
    opener = urllib.request.build_opener(proxy_support)
    opener.addheader = [('User-Agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36(KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36')]
    urllib.request.install_opener(opener)

    #进入词条
    response = urllib.request.urlopen(url)
    html = response.read().decode('utf-8')
    soup = BeautifulSoup(html,'html.parser')

    #进入具体词条内容
    for each in soup.find_all(href = re.compile('item')):
        url2 = ''.join(['https://baike.baidu.com',each['href']])

        #处理秒懂星课堂等链接的汉字问题
        if re.search(u'[\u4e00-\u9fa5]+',url2):
            result = re.search(u'[\u4e00-\u9fa5]+',url2)
            context = result.group()

            url2 = ''.join([url2[0:result.start()],urllib.parse.quote(context)])
            
        response2 = urllib.request.urlopen(url2)
        html2 = response2.read().decode('utf-8')
        soup2 = BeautifulSoup(html2,'html.parser')

        if soup2.h2:
            text = each.text + soup2.h2.text
            print(each.text,soup2.h2.text,'->',url2)
        else:
            print(each.text,'->',url2)
            
if __name__=='__main__':
    main()
[/code]


路过

雷人

握手

鲜花

鸡蛋

评论 (0 个评论)

facelist

您需要登录后才可以评论 登录 | 立即注册

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-7-7 22:09

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

返回顶部