|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 ssqchina 于 2023-7-20 14:21 编辑
这是提取网址的,本地的html文件该如何修改
import requests
from bs4 import BeautifulSoup
url = "https://www.baidu.com/"
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
# kill all script and style elements
for script in soup(["script", "style"]):
script.extract() # rip it out
# get text
text = soup.get_text()
# break into lines and remove leading and trailing space on each
lines = (line.strip() for line in text.splitlines())
# break multi-headlines into a line each
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
# drop blank lines
text = '\n'.join(chunk for chunk in chunks if chunk)
print(text)
- from bs4 import BeautifulSoup
- with open("f:\\456.html", "r", encoding="utf-8") as f:
- content = f.read()
- soup = BeautifulSoup(content, "html.parser")
- print(soup.text)
复制代码
sorry,没考虑编码
|
|