|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
在做爬虫爬b站排行榜详细数据,排行榜数据页面可以正常爬但是每个视频的详细页面request时候就报301
下面是部分运行结果
C:\Users\49461\Desktop\bilibili模型\bilibili>scrapy crawl bsp
2019-02-07 17:28:06 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: bili)
2019-02-07 17:28:06 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
2019-02-07 17:28:06 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'bili', 'DOWNLOAD_DELAY': 1, 'REDIRECT_ENABLED': False, 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['bili.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'}
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-02-07 17:28:06 [scrapy.core.engine] INFO: Spider opened
2019-02-07 17:28:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-07 17:28:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-07 17:28:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bilibili.com/robots.txt> (referer: None)
2019-02-07 17:28:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bilibili.com/ranking/all/4/0/3> (referer: None)
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【老番茄】史上最狠小学生(第四期)',
'pts': '3525890',
'url': '//www.bilibili.com/video/av42184601/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '迦勒底拜年祭',
'pts': '2034108',
'url': '//www.bilibili.com/video/av42565357/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【老E】无惨危机之过年不F♂A红包',
'pts': '912249',
'url': '//www.bilibili.com/video/av42588201/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【FGO】贺岁COS短剧——回迦过年!祝各位Master新春大吉!',
'pts': '907393',
'url': '//www.bilibili.com/video/av42377926/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '罪恶都市速通55分47秒[个人纪录]',
'pts': '788013',
'url': '//www.bilibili.com/video/av42309638/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【C菌】B站三怂再次被吓飞!【生化危机2: 重制版】长篇实况连载, 更新至P17! (健康护眼版)',
'pts': '725282',
'url': '//www.bilibili.com/video/av41726339/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【痴鸡小队Ⅱ】终结:一波三折终吃鸡 大家春节快乐!大吉大利!',
'pts': '716845',
'url': '//www.bilibili.com/video/av42537023/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【花少北】明明是我先来的,你为什么要舔他!?',
'pts': '561309',
'url': '//www.bilibili.com/video/av42689568/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '我好像把一个什么不得了的东西放走了!丨憎恶之西#2',
'pts': '524497',
'url': '//www.bilibili.com/video/av42444881/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '炉石传说:【天天素材库】 第131期',
'pts': '501305',
'url': '//www.bilibili.com/video/av42250633/'}
2019-02-07 17:28:09 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.bilibili.com/video/av42184601/> (referer: None)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<h1>301 Moved Permanently</h1>
<p>The requested resource has been assigned a new permanent URI.</p>
<hr/>Powered by Tengine</body>
</html>
2019-02-07 17:28:10 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.bilibili.com/video/av42661513/> (referer: None)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<h1>301 Moved Permanently</h1>
<p>The requested resource has been assigned a new permanent URI.</p>
<hr/>Powered by Tengine</body>
</html>
2019-02-07 17:28:19 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
爬虫主体
import scrapy
from bs4 import BeautifulSoup
from scrapy.spiders import CrawlSpider
import sys
sys.path.append(r'C:\Users\49461\Desktop\bilibili模型\bilibili\bili')
from items import Bitem
class MySpider(CrawlSpider):
name ='bsp'
allowed_domains = ['www.bilibili.com']
start_urls =['https://www.bilibili.com/ranking/all/4/0/3']
def parse(self,response):
print(response.text)
soup = BeautifulSoup(response.text,features = 'lxml')
div_b_page_body = soup.find('div',class_='b-page-body')
div_rank_list = div_b_page_body.find('div',class_='rank-list-wrap')
ul = div_rank_list.find('ul')
li = ul.find_all('li')
for range_massage in li:
if range_massage is not None:
rinfo = Bitem()
rinfo['name'] = range_massage.find('a',class_='title').get_text()
rinfo['url'] = range_massage.find('a',class_='title').get('href')
next_url =['http:'+range_massage.find('a',class_='title').get('href')]
rinfo['pts'] = range_massage.find('div',class_='pts').get_text().split(sep = '综合得分')[0]
yield scrapy.Request(next_url[0],callback = self.prase_page)
yield rinfo
def prase_page(self,response):
print(response.text)
next_soup = BeautifulSoup(response.text,features = 'lxml')
'print(next_soup)'
pass
item部分
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class Bitem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()#视频名称
url = scrapy.Field()#视频对应URL
pts = scrapy.Field()#视频对应综合得分
towatch = scrapy.Field()#视频总播放量
tocoin = scrapy.Field()#视频总投币
share = scrapy.Field()#分享人数
setting部分
# -*- coding: utf-8 -*-
# Scrapy settings for bili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'bili'
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
SPIDER_MODULES = ['bili.spiders']
HTTPERROR_ALLOWED_CODES = [301]
REDIRECT_ENABLED = False
WSPIDER_MODULE = 'bili.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
'''HEADER={'Accept-Encoding':'gzip, deflate, br'\
,'Accept-Language':'zh-CN'\
,'Cache-Control':'no-cache'\
,'Connection':'Keep-Alive'\
,'Host':'www.bilibili.com'\
,'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'}'''
'''COOIKES ={'buvid3':'2B3C3593-D004-4DDB-89B6-2FCE4D2AAC5F149044infoc'\
,'rpdid':'kmpwsxqsppdospwolswqw'\
,'CURRENT_FNVAL':'16'\
,'LIVE_BUVID':'AUTO2415404519474072'\
,'_uuid':'96B72D07-0CF6-4DBE-AB17-CDA214A4BE4F77784infoc'\
,'stardustvideo':'-1'\
,'sid':'90knrq8s'}'''
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bili (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
大佬们快教下我这个301是怎么回事,该怎么办啊
b站对headers检查很严格,必须加refer, user-agent
|
|