|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
问题1:soup.find("div","about"),这里面的"div","about",什么含义,为什么这么做??如何通过这些参数定位到时间和新闻内容标签上面的,
问题2:就是soup.find("div","about").contents[0][9:].encode('utf-8'),,[0][9:],这里的[0]代表什么,[9:]又表示什么含义
- # encoding: utf-8
- import requests
- import re
- from bs4 import BeautifulSoup
- import time
- class News:
- def __init__(self,title,time,type,content):
- self.title = title #新闻标题
- self.time = time #新闻时间
- self.type = type #新闻类别
- self.content = content #新闻内容
- def getList(url): #获取新闻链接地址
- li = requests.get(url)
- res = r'url":"http:.*?.html' #正则表达式获取链接地址
- urls = re.findall(res,li.text)
- for i in range(len(urls)):
- urls[i] = urls[i][6:]
- return urls
- def getNews(url): #获取新闻内容
- url = url[:-5]+"_0.html" #处理链接获取全文
- ss = requests.get(url)
- soup = BeautifulSoup(ss.text,"html.parser") #获取新闻内容,注意编码
- title = soup.title.string[:-6].encode('utf-8')
- time = soup.find("div","about").contents[0][9:].encode('utf-8')
- # type = soup.find("div","position lBlue").contents[3].string.encode('utf-8')
- content = soup.find("div","content").get_text()[1:-1].encode('utf-8')#如果不采用[1:-1],新闻内容是反得,处理一下,才能正常
- print(content.decode())
- news = News(title,time,type,content)
- return news
- def saveAsTxt(news): #保存新闻内容
- file = open('E:/news.txt','a')
- file.write("标题:" + news.title.decode() +
- "\t时间:" + news.time.decode() +
- # "\t类型:"+ news.type +
- "\t内容:"+ news.content.decode() +
- ""\n")
- start = time.clock()
- sum = 0
- for i in range(1,40):
- wangzhi = "http://3g.163.com/touch/article/list/BA8J7DG9wangning/%s-40.html" %i
- urls = getList(wangzhi)
- sum = sum + len(urls)
- # print "当前页解析出 %s 条" %len(urls)
- j = 1
- for url in urls:
- print ("正在读取第%s页第%s/%s条:%s" %(i,j,len(urls),url.encode('utf-8')))
- news = getNews(url)
- saveAsTxt(news)
- j = j + 1
- end = time.clock()
- print ("共爬取%s条新闻,耗时%f s" %(sum,end - start))
复制代码
问题1:div表示的,about表示的是class='about’
问题2:contents是返回直接子标签,相当于指定标签的所有的下一级标签。contents[0]也就是第1个子标签。[9:]就是字符串切片。
|
|