B站爬虫和页面对不上？,Python交流,编程语言专区,鱼C论坛

麻麦皮 发表于 2020-4-24 10:20:56

B站爬虫和页面对不上？

爬虫最多播放稳定，和页面显示一模一样。

爬虫综合排序就变了，前几个和爬的内容一样，后面的爬虫内容和页面完全对不上。我以为综合排序的页面变了，刷新了页面结果还是原来的页面内容。

有没有大佬解释一下是什么情况？

老八秘制 发表于 2020-4-24 10:49:54

B站也不是完全不反爬的，而且审查元素看到的不一定就是爬下来的

suchocolate 发表于 2020-4-24 11:09:22

headers尽量模仿真实浏览器请求时的headers，不知道如何模仿：【浏览器f12】-【网络】-【上想访问的网站】

会计的会怎么念 发表于 2020-4-24 11:52:26

本帖最后由会计的会怎么念于 2020-4-24 11:54 编辑

- B站没有什么发爬虫
- 多半是你html解析有问题
- 上面两个人我就不具体反驳了，这是我刚发的帖子，就是说的这些问题
- 这是我学习bs4时写的爬B站搜索页面的代码，就这么几行，就能正确获取信息，
- 实际上我从学过之后根本没用过bs4，我全部都是lxml结合xpath
import requests
from bs4 import BeautifulSoup

def get_html(url):
headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0'
         }
response = requests.get(url=url, headers=headers)
if response.status_code != requests.codes.ok:
   print("Request Filed")
else:
   print("OK")
   return response.text

def extract_title(html):
soup = BeautifulSoup(html, 'lxml')
with open('bili_title.txt', 'a') as file:
   for each in soup.find_all(class_='info'):
         title = each.find(class_='headline clearfix').find(name='a')['title']
         try:
            file.write(title + '\n')
         except UnicodeEncodeError:
            continue
         print(title)

if __name__ == '__main__':
for i in range(1, 51):
   url = "https://search.bilibili.com/all?keyword=%E7%BC%96%E7%A8%8B&from_source=banner_search&order=click&duration=0&tids_1=0&page=" + str(i)
   html = get_html(url)
   extract_title(html)

- 加油！

页: [1]

鱼C论坛's Archiver

B站爬虫和页面对不上？