正则爬取B站弹幕遇到的错误
import urllib.requestimport re
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}
def open_url(url):
req = urllib.request.Request(url,headers = headers)
response = urllib.request.urlopen(req)
html = response.read().decode('utf-8')
return html
def danmu():
url = 'https://www.bilibili.com/video/BV1Fs411A7HZ'
html = open_url(url)
danmus = re.findall('<span class=.+title="(.+)">',html)
print(danmus)
danmu()
错误:UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
求帮忙{:10_262:} 本帖最后由 xiaosi4081 于 2020-5-25 17:17 编辑
还用urllib啊?用requests:
import request
import re
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}
def open_url(url):
req = urllib.request.Request(url,headers = headers)
response = reuqests.get(url,headers=headers)
html = response.text
return html
def danmu():
url = 'https://www.bilibili.com/video/BV1Fs411A7HZ'
html = open_url(url)
danmus = re.findall('<span class=.+title="(.+)">',html)
for each in danmus
print(each)
danmu() 本帖最后由 悠悠2264 于 2020-5-25 17:24 编辑
python自带的urllib不太方便,可以使用requests,很方便,因为它可以get网页并自动解码(.text)
安装方法,在cmd输入(-i使用镜像):
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple/
修改后的代码如下:
import requests
import re
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400'}
def open_url(url):
html = req.text
print(html)
return html
def danmu():
url = 'https://www.bilibili.com/video/BV1Fs411A7HZ'
html = open_url(url)
danmus = re.findall('<span class=.+title="(.+)">',html)
print(danmus)
danmu()
urllib的: req = urllib.request.Request(url,headers = headers)
response = urllib.request.urlopen(req)
就相当于requests的:
req = requests.get(url,headers = headers)
是不是很方便?{:10_256:} 悠悠2264 发表于 2020-5-25 17:19
python自带的urllib不太方便,可以使用requests,很方便,因为它可以get网页并自动解码(.text)
安装方 ...
我print(html)得到的是:
<?xml version="1.0" encoding="UTF-8"?><i><chatserver>chat.bilibili.com</chatserver><chatid>165642449</chatid><mission>0</mission><maxlimit>8000</maxlimit><state>0</state><real_name>0</real_name><source>k-v</source><d p="1940.09400,1,25,16777215,1584517362,0,77d09b6a,30081122185510915">èå¸ç»çå缩å ä½ ä»¬è½ç¨å</d><d p="2333.04900,1,25,16777215,1584774769,0,56d68bed,30216077480820743">åˉ以</d><d p="2229.33000,1,25,16777215,1584883676,0,cc14bc36,30273176252448775">è¿ä¸aå¨Pythonéé¢æˉä»é¶å¼å§</d><d p="788.26000,1,25,16777215,1586793187,0,6746313c,31274310200459271">ååå</d><d p="2828.19900,1,25,16777215,1586794560,0,6746313c,31275029914189831">è¿éï¼</d><d p="5557.51000,1,25,16777215,1587046015,0,c417c5f4,31406864605904903">ææ</d><d p="793.29900,1,25,16777215,1587094361,0,e4d148d8,31432212221001735">èˉ·æ¶ä¸æçèç</d><d p="4042.10600,1,25,16777215,1587282234,0,e6c7e08e,31530711291265027">å·ä»åååå</d><d p="2008.81800,1,25,16777215,1587800261,0,ff1e0259,31802306895806467">å¨è°·æ-æμè§å¨ä¸è¿ä¸aæä½çå¿«æ·é®æˉshift+ctrl+c</d><d p="2239.46800,1,25,16777215,1587800709,0,ff1e0259,31802541766344711">åæ3èˉ′å°±çè§åé¢çå¼1å1a</d><d p="3097.18000,1,25,16777215,1588057816,0,4f921cf3,31937339999649799">æ们å-|æ ¡èå¸ä¸è¡ï¼è¿ä¸aèå¸è®2å¾å¤aå®1ææäo</d><d p="5339.60100,1,25,16777215,1588061605,0,4f921cf3,31939326645370883">tm..</d><d p="2168.87200,1,25,16777215,1588559886,0,abcf5241,32200569232818183">èæ</d><d p="2311.02500,1,25,16777215,1588559957,0,abcf5241,32200606474567685">ç»äoæä¼äo</d><d p="5063.10400,1,25,16777215,1588673247,0,944cf531,32260002837692421">CSSè£ é¥°ç</d><d p="1230.55900,1,25,16777215,1588693802,0,92af1d9b,32270779487354885">æ¾æμè§å¨ç®å½å</d><d p="1446.26700,1,25,16777215,1588693961,0,92af1d9b,32270863238692871">éç1æ¥äo</d><d p="2300.84600,1,25,16777215,1588694833,0,92af1d9b,32271320342331397">è¿åo|æ¡æäoo</d><d p="2841.27700,1,25,16777215,1588695247,0,92af1d9b,32271537425350663">highlight</d><d p="2914.79600,1,25,16777215,1588695321,0,92af1d9b,32271575970480135">æ件çé®é¢å§</d><d p="4734.77100,1,25,16777215,1588696810,0,92af1d9b,32272356597563399">çå®è¿ä¸aççμå½±å»äo</d><d p="5058.57300,1,25,16777215,1588697010,0,92af1d9b,32272461596721155">çaç¶è§å¾è¿å£°é3åå¢æ¬ä¼</d><d p="5551.41300,1,25,16777215,1588697289,0,92af1d9b,32272607636094979">è¿é′</d><d p="2181.75700,1,25,16777215,1589344442,0,34be8334,32611902389485575">èæ</d><d p="947.20600,1,25,16777215,1590321135,0,9b9c5f31,33123970686910467">ææ¡£å¨åaï¼</d><d p="801.97800,1,25,16777215,1590400268,0,8a5f121,33165459228983301">ååå</d></i>
请问要怎么得到弹幕的信息呢? 开心果. 发表于 2020-5-25 19:32
我print(html)得到的是:
chat.bilibili.com1656424490800000k-vèå¸ç ...
没事了,搞懂了 开心果. 发表于 2020-5-25 19:45
没事了,搞懂了
ok,给个最佳吧{:10_254:}
页:
[1]