|
|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
- import urllib.request
- import urllib.parse
- from lxml import etree
- from bs4 import BeautifulSoup
- url="http://news.baidu.com/ns?word=title%3A%28%E5%B9%B3%E5%AE%89%29&pn=0&cl=2&ct=1&tn=newstitle&rn=20&ie=utf-8&bt=0&et=0"
- res=urllib.request.Request(url)
- res.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36 Core/1.53.3368.400 QQBrowser/9.6.11974.400')
- response=urllib.request.urlopen(res).read()
- soup=BeautifulSoup(response,'html.parser')
- soup_list=soup.find_all('div',class_="result title")
- for eat in soup_list:
- title=eat.select('a')[0].get_text()
- link=eat.a.get("href")
- ti=eat.select('div')[0].get_text().split()[1]
- print("%s\n%s\n%s\n\n"%(title,link,ti))
复制代码
这是用BeautifulSoup提取的效果
BeautifulSoup
用xpath的text()提出的是一节一节的不完整,内容一段在em节点内,一段在a节点内
|
|