|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
刚刚开始学爬虫,准备爬取公司内部的一个网页把网页内容提取出来存档, 网页效果如下, 我准备爬取左边导航栏每个选项点击进去后的内容, 里面分好几层
现在还只是开始学写爬虫的初期, 写了下面一点点准备看一下能不能把网页打开
- import requests
- from bs4 import BeautifulSoup
- from requests.auth import HTTPBasicAuth
- headers = {
- 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
- }
- response = requests.get(url,auth=HTTPBasicAuth("username","password")) #这里把网页地址和用户名密码隐藏了 不方便展示
- soup = BeautifulSoup(response.text,'lxml')
- print(response.text)
复制代码
但是代码运行后只出现了以下输出, 貌似并没有对frameset里面的内容进行展开, 我也无法抓取里面的链接, 请问我需要学习哪部分的知识才能解决这个问题, 给我个思路就可以~~
谢谢!!
- <html>
- <head>
- <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
- <meta name="GENERATOR" content="Microsoft FrontPage 3.0">
- <meta name="Microsoft Border" content="none">
- <title>FPPS Home</title>
- </head>
- <frameset framespacing="0" border="false" frameborder="0" rows="128,*">
- <frame name="banner" scrolling="auto" noresize target="contents"
- src="home/&frames_home/home_top.htm" style="border-bottom: 2px none rgb(0,0,255)"
- marginwidth="0" marginheight="0">
- <frameset cols="18*,85%">
- <frame name="contents" target="main" src="home/&frames_home/home_left.htm"
- scrolling="auto" marginwidth="0" marginheight="0" style="border: 2px none rgb(0,0,255)">
- <frame name="contents1" src="home/home_home.htm" scrolling="auto" marginwidth="1"
- marginheight="1" style="border: 0px none; padding-left: 5; padding-top: 0">
- </frameset>
- <noframes>
- <body>
- <p>This page uses frames, but your browser doesn't support them.</p>
- </body>
- </noframes>
- </frameset>
- </html>
复制代码
学scrapy
或者用bs抓第二层网页再用requests打开
|
|