|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
当BeautifulSoup解析完网页后,发现所需的url不在里边,应该是被隐藏了,应该怎么办,求教各位大佬。如下代码所示,想findall('a')来查出所有url,可是返回值为空
- liebiao=[]
- headers = {
- 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
- 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
- 'Accept-Language':'en-US,en;q=0.5',
- 'Accept-Encoding':'gzip',
- 'DNT':'1',
- 'Connection':'close'
- }
- page = requests.get("https://search.bilibili.com/all?keyword=%E7%8C%AB%E5%92%AA%E6%97%A5%E5%B8%B8%E7%94%A8%E5%93%81&from_source=webtop_search&spm_id_from=333.1007&search_source=5", headers=headers)
- print(page)
-
- soup_obj=BeautifulSoup(page.content,'html.parser')
- print(soup_obj)
- for link in soup_obj.findAll('a'):#含a
- if "href" in link.attrs:#且以href作为特征/如果link的特征项里有href.
- a=link.attrs['href']
- if 'www.bilibili.com/video' in a:
- #print(a)
- #以下存储url到列表里
- liebiao.append("https:"+a)
复制代码
本帖最后由 suchocolate 于 2022-10-26 00:27 编辑
- import requests
- from bs4 import BeautifulSoup
- import re
- def main():
- result = []
- headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
- url = "https://search.bilibili.com/all?keyword=%E7%8C%AB%E5%92%AA%E6%97%A5%E5%B8%B8%E7%94%A8%E5%93%81&from_source=" \
- "webtop_search&spm_id_from=333.1007&search_source=5"
- r = requests.get(url, headers=headers)
- r.encoding = 'utf-8'
- soup = BeautifulSoup(r.text, 'html.parser')
- for item in soup.find_all('a', attrs={'href': re.compile('bili.com/video')}):
- result.append(f"https:{item['href']}")
- print(result)
- if __name__ == "__main__":
- main()
复制代码
|
|