开心果. 发表于 2020-5-25 16:18:01

正则爬取B站弹幕遇到的错误

import urllib.request
import re

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}


def open_url(url):
    req = urllib.request.Request(url,headers = headers)
    response = urllib.request.urlopen(req)
    html = response.read().decode('utf-8')
    return html

def danmu():

    url = 'https://www.bilibili.com/video/BV1Fs411A7HZ'
    html = open_url(url)

    danmus = re.findall('<span class=.+title="(.+)">',html)
    print(danmus)


danmu()

错误:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

求帮忙{:10_262:}

xiaosi4081 发表于 2020-5-25 17:14:28

本帖最后由 xiaosi4081 于 2020-5-25 17:17 编辑

还用urllib啊?用requests:

import request
import re

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}


def open_url(url):
    req = urllib.request.Request(url,headers = headers)
    response = reuqests.get(url,headers=headers)
    html = response.text
    return html

def danmu():

    url = 'https://www.bilibili.com/video/BV1Fs411A7HZ'
    html = open_url(url)

    danmus = re.findall('<span class=.+title="(.+)">',html)
    for each in danmus
            print(each)


danmu()

悠悠2264 发表于 2020-5-25 17:19:26

本帖最后由 悠悠2264 于 2020-5-25 17:24 编辑

python自带的urllib不太方便,可以使用requests,很方便,因为它可以get网页并自动解码(.text)

安装方法,在cmd输入(-i使用镜像):
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple/

修改后的代码如下:
import requests
import re

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}


def open_url(url):
    html = req.text
    print(html)
    return html

def danmu():

    url = 'https://www.bilibili.com/video/BV1Fs411A7HZ'
    html = open_url(url)

    danmus = re.findall('<span class=.+title="(.+)">',html)
    print(danmus)

danmu()


urllib的:    req = urllib.request.Request(url,headers = headers)
    response = urllib.request.urlopen(req)
就相当于requests的:
req = requests.get(url,headers = headers)
是不是很方便?{:10_256:}

开心果. 发表于 2020-5-25 19:32:35

悠悠2264 发表于 2020-5-25 17:19
python自带的urllib不太方便,可以使用requests,很方便,因为它可以get网页并自动解码(.text)

安装方 ...

我print(html)得到的是:
<?xml version="1.0" encoding="UTF-8"?><i><chatserver>chat.bilibili.com</chatserver><chatid>165642449</chatid><mission>0</mission><maxlimit>8000</maxlimit><state>0</state><real_name>0</real_name><source>k-v</source><d p="1940.09400,1,25,16777215,1584517362,0,77d09b6a,30081122185510915">老师给的压缩包你们能用吗</d><d p="2333.04900,1,25,16777215,1584774769,0,56d68bed,30216077480820743">åˉ以</d><d p="2229.33000,1,25,16777215,1584883676,0,cc14bc36,30273176252448775">è¿™ä¸a在Python里面æ˜ˉ从零开始</d><d p="788.26000,1,25,16777215,1586793187,0,6746313c,31274310200459271">哈哈哈</d><d p="2828.19900,1,25,16777215,1586794560,0,6746313c,31275029914189831">这里?</d><d p="5557.51000,1,25,16777215,1587046015,0,c417c5f4,31406864605904903">战战</d><d p="793.29900,1,25,16777215,1587094361,0,e4d148d8,31432212221001735">èˉ·æ”¶ä¸‹æˆ‘的膝盖</d><d p="4042.10600,1,25,16777215,1587282234,0,e6c7e08e,31530711291265027">喷他哈哈哈哈</d><d p="2008.81800,1,25,16777215,1587800261,0,ff1e0259,31802306895806467">在谷æ-Œæμè§ˆå™¨ä¸Šè¿™ä¸a操作的快捷键æ˜ˉshift+ctrl+c</d><d p="2239.46800,1,25,16777215,1587800709,0,ff1e0259,31802541766344711">刚æƒ3èˉ′就看见前面的å¼1å1•a†</d><d p="3097.18000,1,25,16777215,1588057816,0,4f921cf3,31937339999649799">我们å-|校老师不行,这ä¸a老师è®2å¾—å¤aå®1易懂äo†</d><d p="5339.60100,1,25,16777215,1588061605,0,4f921cf3,31939326645370883">tm..</d><d p="2168.87200,1,25,16777215,1588559886,0,abcf5241,32200569232818183">舒服</d><d p="2311.02500,1,25,16777215,1588559957,0,abcf5241,32200606474567685">终äoŽæžä¼šäo†</d><d p="5063.10400,1,25,16777215,1588673247,0,944cf531,32260002837692421">CSS装饰的</d><d p="1230.55900,1,25,16777215,1588693802,0,92af1d9b,32270779487354885">放æμè§ˆå™¨ç›®å½•å‘—</d><d p="1446.26700,1,25,16777215,1588693961,0,92af1d9b,32270863238692871">重ç‚1来äo†</d><d p="2300.84600,1,25,16777215,1588694833,0,92af1d9b,32271320342331397">è¿›åo|条感äoo</d><d p="2841.27700,1,25,16777215,1588695247,0,92af1d9b,32271537425350663">highlight</d><d p="2914.79600,1,25,16777215,1588695321,0,92af1d9b,32271575970480135">插件的问题吧</d><d p="4734.77100,1,25,16777215,1588696810,0,92af1d9b,32272356597563399">看完这ä¸a看ç”μ影去äo†</d><d p="5058.57300,1,25,16777215,1588697010,0,92af1d9b,32272461596721155">çaç„¶è§‰å¾—这声éŸ3像卢本伟</d><d p="5551.41300,1,25,16777215,1588697289,0,92af1d9b,32272607636094979">è¿œé‰′</d><d p="2181.75700,1,25,16777215,1589344442,0,34be8334,32611902389485575">舒服</d><d p="947.20600,1,25,16777215,1590321135,0,9b9c5f31,33123970686910467">文档在å“a?</d><d p="801.97800,1,25,16777215,1590400268,0,8a5f121,33165459228983301">哇哈哈</d></i>

请问要怎么得到弹幕的信息呢?

开心果. 发表于 2020-5-25 19:45:32

开心果. 发表于 2020-5-25 19:32
我print(html)得到的是:
chat.bilibili.com1656424490800000k-v老师&#231 ...

没事了,搞懂了

悠悠2264 发表于 2020-5-25 20:21:47

开心果. 发表于 2020-5-25 19:45
没事了,搞懂了

ok,给个最佳吧{:10_254:}
页: [1]
查看完整版本: 正则爬取B站弹幕遇到的错误