|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
- import requests
- import re
- import bs4
- '''
- 1.获取页面源代码
- 2.获取章节链接
- 3.获取章节页面源码
- 4.获取章节内容
- 5下载
- '''
- def get_url():
- url = 'https://www.dukeba.com/947/'
- headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
- res = requests.get(url,headers =headers)
- res.encoding='gbk'
- soup =bs4.BeautifulSoup(res.text,'html.parser')
- data=[]
- for dd in soup.find_all('dd'):
- link = dd.find('a')
- if not link:
- continue
- data.append(('https://www.dukeba.com/947/%s'%link['href'],link.get_text()))
- return data
- def content(res):
- headers ={'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36'}
- r = requests.get(res,headers =headers)
- r.encoding='gbk'
- '''
- soup =bs4.BeautifulSoup(r.text,'html.parser')
- soup2= soup.find('div',id='content').get_text()
- soup2.replace('/xa0','')
- return soup2'''
- reg='<div id="content">(.*?)</div>'
- chapt_content =re.findall(reg,r.text,re.S)
- #数据清洗未成功
- chapt_content =chapt_content[0].replace(' ','')
- chapt_content =chapt_content.replace('<br />','')
- return chapt_content
- def main():
- for data in get_url():
- res,title = data
- print("正在下载%s"%title)
- with open(r'C:\Users\liuhang\Desktop\os创建文件夹\%s.txt'%title,'w') as file:
- file.write(content(res))
-
- main()
复制代码
如题,为什么只能下载保存两章小说,然后会报错
IndexError: list index out of range |
|