|
发表于 2020-4-24 11:52:26
|
显示全部楼层
本帖最后由 会计的会怎么念 于 2020-4-24 11:54 编辑
- B站没有什么发爬虫
- 多半是你html解析有问题
- 上面两个人我就不具体反驳了,这是我刚发的帖子,就是说的这些问题
- 这是我学习bs4时写的爬B站搜索页面的代码,就这么几行,就能正确获取信息,
- 实际上我从学过之后根本没用过bs4,我全部都是lxml结合xpath
- import requests
- from bs4 import BeautifulSoup
- def get_html(url):
- headers = {
- 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0'
- }
- response = requests.get(url=url, headers=headers)
- if response.status_code != requests.codes.ok:
- print("Request Filed")
- else:
- print("OK")
- return response.text
- def extract_title(html):
- soup = BeautifulSoup(html, 'lxml')
- with open('bili_title.txt', 'a') as file:
- for each in soup.find_all(class_='info'):
- title = each.find(class_='headline clearfix').find(name='a')['title']
- try:
- file.write(title + '\n')
- except UnicodeEncodeError:
- continue
- print(title)
- if __name__ == '__main__':
- for i in range(1, 51):
- url = "https://search.bilibili.com/all?keyword=%E7%BC%96%E7%A8%8B&from_source=banner_search&order=click&duration=0&tids_1=0&page=" + str(i)
- html = get_html(url)
- extract_title(html)
复制代码
- 加油! |
|