鱼C论坛

 找回密码
 立即注册
查看: 706|回复: 11

[已解决]scarpy爬虫问题

[复制链接]
发表于 2019-2-7 17:38:50 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
在做爬虫爬b站排行榜详细数据,排行榜数据页面可以正常爬但是每个视频的详细页面request时候就报301
下面是部分运行结果
C:\Users\49461\Desktop\bilibili模型\bilibili>scrapy crawl bsp
2019-02-07 17:28:06 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: bili)
2019-02-07 17:28:06 [scrapy.utils.log] INFO: Versions: lxml 4.3.0.0, libxml2 2.9.7, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j  20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.17134-SP0
2019-02-07 17:28:06 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'bili', 'DOWNLOAD_DELAY': 1, 'REDIRECT_ENABLED': False, 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['bili.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'}
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-07 17:28:06 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-02-07 17:28:06 [scrapy.core.engine] INFO: Spider opened
2019-02-07 17:28:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-07 17:28:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-07 17:28:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bilibili.com/robots.txt> (referer: None)
2019-02-07 17:28:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bilibili.com/ranking/all/4/0/3> (referer: None)
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【老番茄】史上最狠小学生(第四期)',
'pts': '3525890',
'url': '//www.bilibili.com/video/av42184601/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '迦勒底拜年祭',
'pts': '2034108',
'url': '//www.bilibili.com/video/av42565357/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【老E】无惨危机之过年不F♂A红包',
'pts': '912249',
'url': '//www.bilibili.com/video/av42588201/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【FGO】贺岁COS短剧——回迦过年!祝各位Master新春大吉!',
'pts': '907393',
'url': '//www.bilibili.com/video/av42377926/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '罪恶都市速通55分47秒[个人纪录]',
'pts': '788013',
'url': '//www.bilibili.com/video/av42309638/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【C菌】B站三怂再次被吓飞!【生化危机2: 重制版】长篇实况连载, 更新至P17! (健康护眼版)',
'pts': '725282',
'url': '//www.bilibili.com/video/av41726339/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【痴鸡小队Ⅱ】终结:一波三折终吃鸡 大家春节快乐!大吉大利!',
'pts': '716845',
'url': '//www.bilibili.com/video/av42537023/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '【花少北】明明是我先来的,你为什么要舔他!?',
'pts': '561309',
'url': '//www.bilibili.com/video/av42689568/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '我好像把一个什么不得了的东西放走了!丨憎恶之西#2',
'pts': '524497',
'url': '//www.bilibili.com/video/av42444881/'}
2019-02-07 17:28:08 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bilibili.com/ranking/all/4/0/3>
{'name': '炉石传说:【天天素材库】 第131期',
'pts': '501305',
'url': '//www.bilibili.com/video/av42250633/'}
2019-02-07 17:28:09 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.bilibili.com/video/av42184601/> (referer: None)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<h1>301 Moved Permanently</h1>
<p>The requested resource has been assigned a new permanent URI.</p>
<hr/>Powered by Tengine</body>
</html>

2019-02-07 17:28:10 [scrapy.core.engine] DEBUG: Crawled (301) <GET http://www.bilibili.com/video/av42661513/> (referer: None)
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<h1>301 Moved Permanently</h1>
<p>The requested resource has been assigned a new permanent URI.</p>
<hr/>Powered by Tengine</body>
</html>


2019-02-07 17:28:19 [scrapy.crawler] INFO: Received SIGINT twice, forcing unclean shutdown
爬虫主体
import scrapy
from bs4 import BeautifulSoup
from scrapy.spiders import CrawlSpider
import sys
sys.path.append(r'C:\Users\49461\Desktop\bilibili模型\bilibili\bili')
from items import Bitem


class MySpider(CrawlSpider):
    name ='bsp'
    allowed_domains = ['www.bilibili.com']
    start_urls =['https://www.bilibili.com/ranking/all/4/0/3']


    def parse(self,response):
        print(response.text)
        soup = BeautifulSoup(response.text,features = 'lxml')
        div_b_page_body = soup.find('div',class_='b-page-body')
        div_rank_list = div_b_page_body.find('div',class_='rank-list-wrap')
        ul = div_rank_list.find('ul')
        li = ul.find_all('li')
        for range_massage in li:
            if range_massage is not None:
                rinfo = Bitem()
                rinfo['name'] = range_massage.find('a',class_='title').get_text()
                rinfo['url'] = range_massage.find('a',class_='title').get('href')
                next_url =['http:'+range_massage.find('a',class_='title').get('href')]
                rinfo['pts'] = range_massage.find('div',class_='pts').get_text().split(sep = '综合得分')[0]
                yield scrapy.Request(next_url[0],callback = self.prase_page)
            yield rinfo
    def prase_page(self,response):
        print(response.text)
        next_soup = BeautifulSoup(response.text,features = 'lxml')
        'print(next_soup)'
        pass
            
item部分
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class Bitem(scrapy.Item):
    # define the fields for your item here like:
    name = scrapy.Field()#视频名称
    url = scrapy.Field()#视频对应URL
    pts = scrapy.Field()#视频对应综合得分
    towatch = scrapy.Field()#视频总播放量
    tocoin = scrapy.Field()#视频总投币
    share = scrapy.Field()#分享人数
setting部分
# -*- coding: utf-8 -*-

# Scrapy settings for bili project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'bili'
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True
SPIDER_MODULES = ['bili.spiders']
HTTPERROR_ALLOWED_CODES = [301]
REDIRECT_ENABLED = False
WSPIDER_MODULE = 'bili.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'
'''HEADER={'Accept-Encoding':'gzip, deflate, br'\
            ,'Accept-Language':'zh-CN'\
            ,'Cache-Control':'no-cache'\
            ,'Connection':'Keep-Alive'\
            ,'Host':'www.bilibili.com'\
            ,'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134'}'''


'''COOIKES ={'buvid3':'2B3C3593-D004-4DDB-89B6-2FCE4D2AAC5F149044infoc'\
          ,'rpdid':'kmpwsxqsppdospwolswqw'\
          ,'CURRENT_FNVAL':'16'\
          ,'LIVE_BUVID':'AUTO2415404519474072'\
          ,'_uuid':'96B72D07-0CF6-4DBE-AB17-CDA214A4BE4F77784infoc'\
          ,'stardustvideo':'-1'\
          ,'sid':'90knrq8s'}'''
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'bili (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True
大佬们快教下我这个301是怎么回事,该怎么办啊
最佳答案
2019-2-7 19:20:52
b站对headers检查很严格,必须加refer, user-agent
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2019-2-7 17:41:44 | 显示全部楼层
大佬们快来救救我,卡了一个春节了
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2019-2-7 17:42:47 | 显示全部楼层
ROBOTSTXT_OBEY = False
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2019-2-7 18:20:12 | 显示全部楼层

不是有抓的到么?
捕获.PNG
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2019-2-7 19:03:54 | 显示全部楼层
_谪仙 发表于 2019-2-7 18:20
不是有抓的到么?

里面抓出来url在抓就全部遇到301了
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2019-2-7 19:06:23 | 显示全部楼层
_谪仙 发表于 2019-2-7 18:20
不是有抓的到么?

前面第一个页面时正常的,我从第一个页面爬出url后在进这个url就遇到了301
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2019-2-7 19:07:08 | 显示全部楼层

这个试过了,拒绝那个协议还是这个结果
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2019-2-7 19:20:52 | 显示全部楼层    本楼为最佳答案   
b站对headers检查很严格,必须加refer, user-agent
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2019-2-8 09:22:49 | 显示全部楼层
幽梦三影 发表于 2019-2-7 19:20
b站对headers检查很严格,必须加refer, user-agent

我来试下
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2019-2-8 09:28:38 | 显示全部楼层
幽梦三影 发表于 2019-2-7 19:20
b站对headers检查很严格,必须加refer, user-agent

怎么能够看到网站是否对这些东西有限制呢
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2019-2-8 09:41:12 | 显示全部楼层
yang930808 发表于 2019-2-8 09:28
怎么能够看到网站是否对这些东西有限制呢


只有一个办法,试
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2019-2-8 17:27:45 | 显示全部楼层
幽梦三影 发表于 2019-2-8 09:41
只有一个办法,试

我找到原因了,它排行榜里给的地址时http的但是它的视频地址会自动跳到https
想知道小甲鱼最近在做啥?请访问 -> ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2024-5-18 06:10

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表