鱼C论坛

 找回密码
 立即注册
查看: 1496|回复: 1

list index out of range报错求助

[复制链接]
发表于 2018-6-19 17:52:55 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
从新浪新闻页爬取新闻信息,整个过程代码如下,运行后报错内容如下,请教各位大神,我的问题出在那里,应该怎么解决?新手一枚,请大神帮忙看下。
import requests
from bs4 import BeautifulSoup
from datetime import datetime
import re
import json
import pandas
import sqlite3

commenturl = 'http://comment5.news.sina.com.cn/page/info?\version=1&format=js&channel=gn&newsid=comos-{}&\group=&compress=0&ie=utf-8&oe=utf-8&page=1&\page_size=20'

#获取评论数量
def getCommentCounts(newsurl):
    m = re.search('doc-i(.*).shtml', newsurl)
    newsid = m.group(1)
    comments = requests.get(commenturl.format(newsid))
    jd = json.loads(comments.text.strip('var data='))
    return jd['result']['count']['total']

#获取新闻详情
def getNewsDetail(newsurl):
    result = {}
    res = requests.get(newsurl)
    res.encoding = 'utf-8'
    soup = BeautifulSoup(res.text, 'html.parser')
    result['title'] = soup.select('#artibodyTitle')[0].text
    result['newssource'] = soup.select('.time-source span a')[0].text
    timesource = soup.select('.time-source')[0].contents[0].strip()
    result['dt'] = datetime.strptime(timesource,'%Y年%m月%d日%H:%M')
    result['article'] = '@'.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])
    result['editor'] = soup.select('.article-editor')[0].text.strip('责任编辑:')
    result['comments'] = getCommentCounts(newsurl)
    return result

#解析分页连接
def parseListLinks(url):
    newsdetails = []
    res = requests.get(url)
    jd = json.loads(res.text.rstrip(');').lstrip(' newsloadercallback('))
    for ent in jd['result']['data']:
        newsdetails.append(getNewsDetail(ent['url']))
    return newsdetails

#url为分页链接,关键参数page
url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}&callback=newsloadercallback&_=1509779364426'
news_total = []

#抓取1,2两页新闻信息
for i in range(1,2):
    newsurl = url.format(i)
    newsary = parseListLinks(newsurl)
    news_total.extend(newsary)

#使用sqlite存储数据,pandas清晰展示数据
df = pandas.DataFrame(news_total)   
with sqlite3.connect('news.sqlite') as db:
    df2 = pandas.read_sql_query('select * from news', con = db)


运行后报错:
IndexError                                Traceback (most recent call last)
<ipython-input-2-037ecb2a6d10> in <module>()
     48 for i in range(1,2):
     49     newsurl = url.format(i)
---> 50     newsary = parseListLinks(newsurl)
     51     news_total.extend(newsary)
     52

<ipython-input-2-037ecb2a6d10> in parseListLinks(url)
     38     jd = json.loads(res.text.rstrip(');').lstrip(' newsloadercallback('))
     39     for ent in jd['result']['data']:
---> 40         newsdetails.append(getNewsDetail(ent['url']))
     41     return newsdetails
     42

<ipython-input-2-037ecb2a6d10> in getNewsDetail(newsurl)
     23     res.encoding = 'utf-8'
     24     soup = BeautifulSoup(res.text, 'html.parser')
---> 25     result['title'] = soup.select('#artibodyTitle')[0].text
     26     result['newssource'] = soup.select('.time-source span a')[0].text
     27     timesource = soup.select('.time-source')[0].contents[0].strip()

IndexError: list index out of range
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

 楼主| 发表于 2018-6-20 09:37:19 | 显示全部楼层
up
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-12-31 11:17

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表