|
像这种特殊情况,基于规则的处理也很麻烦。如果句子不规范,很容易引起歧义。
比如同为 3:30 p.m.
在Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.是两个句子,
而在At 3:30 p.m. Tom attended a meeting which was held at example.com.是一个句子。
我的建议是规范输入,或者采用大模型(Chatgpt3.0在这方面表现出色)【基于统计和概率才是文本处理的最终解】。
下面是修改了一些错误,并经Chatgpt3.0优化的代码。但是还会存在一些问题。
- import re
- # 定义相关的正则表达式模式
- alphabets = "([A-Za-z])"
- prefixes = "(Mr|St|Mrs|Ms|Dr|Prof)[.]"
- suffixes = "(Inc|Ltd|Jr|Sr|Co)"
- starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
- acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
- websites = "[.](com|net|org|io|gov|edu|me)"
- digits = "([0-9])"
- multiple_dots = r'\.{2,}'
- def split_into_sentences(text):
- """
- Split the text into sentences.
- :param text: text to be split into sentences
- :type text: str
- :return: list of sentences
- :rtype: list[str]
- """
- # 初始化文本
- text = " " + text + " "
- text = text.replace("\n", " ")
-
- # 处理缩写
- text = re.sub(prefixes, "\\1<prd>", text)
- text = re.sub(websites, "<prd>\\1", text)
- text = re.sub(digits + "[.]" + digits, "\\1<prd>\\2", text)
- text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
- # 处理特定缩写(如Ph.D.)
- if "Ph.D" in text:
- text = text.replace("Ph.D.", "Ph<prd>D<prd>")
- # 处理字母后的点
- text = re.sub("\s" + alphabets + "[.] ", " \\1<prd> ", text)
- # 处理缩写后的句子拆分
- text = re.sub(acronyms + " " + starters, "\\1<stop> \\2", text)
- text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
- text = re.sub(alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>", text)
- # 处理后缀
- text = re.sub(" " + suffixes + "[.] " + starters, " \\1<stop> \\2", text)
- text = re.sub(" " + suffixes + "[.]", " \\1<prd>", text)
- text = re.sub(" " + alphabets + "[.]", " \\1<prd>", text)
- # 处理引号和标点符号的情况
- if "”" in text: text = text.replace(".”", "”.")
- if '"' in text: text = text.replace('."', '".')
- if "!" in text: text = text.replace('!"', '"!')
- if "?" in text: text = text.replace('?"', '"?')
- # 替换句号,问号和感叹号后加 <stop>
- text = text.replace(".", ".<stop>")
- text = text.replace("?", "?<stop>")
- text = text.replace("!", "!<stop>")
- # 还原已替换的 <prd> 为句点
- text = text.replace("<prd>", ".")
- # 处理其他标点符号的间隔问题
- text = re.sub(r'([,][\'"»’]*)(\w)', r'\1 \2', text)
- text = re.sub(r'\s+([.?!,;:])', r'\1', text)
- text = re.sub(r' +', ' ', text)
- # 处理中文与英文间的断句问题
- text = re.sub(r'(\w)([,。!?])', r'\1 <stop> \2', text)
- text = re.sub(r'([。!?])(\w)', r'\1 <stop> \2', text)
- # 按照<stop>分割句子
- sentences = text.split("<stop>")
- sentences = [s.strip() for s in sentences]
-
- # 清除空句子
- if sentences and not sentences[-1]:
- sentences = sentences[:-1]
-
- return sentences
- text = """
- I love you.I am Tom. Hello,world!How are you ?
- Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
- "I love this project!" exclaimed Prof. Brown. "It's amazing."
- The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
- I love you. Tom said.
- I love you.Tom said.
- “苏超”联赛火爆出圈,场均观众8798人远超中甲,门票一票难求!全民参与+城市荣誉模式激活经济内循环,地域热梗成看点,文旅消费激增,政府主导打造“移动的城市广告”。
- 网上热度也是相当狂飙。 虎扑App紧急新增“江苏联”频道,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了,甚至有球迷在二手物品交易平台表示,愿意花100元求购一张门票。
- But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard, that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"
-
-
- """
- for s in split_into_sentences(text):
- print(s)
复制代码
输出:
- I love you.
- I am Tom.
- Hello, world!
- How are you?
- Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
- "I love this project"!
- exclaimed Prof. Brown.
- "It's amazing".
- The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
- I love you.
- Tom said.
- I love you.
- Tom said.
- “苏超”联赛火爆出圈
- ,场均观众8798人远超中甲
- ,门票一票难求
- !
- 全民参与+城市荣誉模式激活经济内循环
- ,地域热梗成看点
- ,文旅消费激增
- ,政府主导打造“移动的城市广告”。 网上热度也是相当狂飙
- 。 虎扑App紧急新增“江苏联”频道
- ,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿
- ,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了
- ,甚至有球迷在二手物品交易平台表示
- ,愿意花100元求购一张门票
- 。 But then Wei poured a shower of gold coins into her lap".
- Never mind where I got them", he whispered.
- "Let's just say...
- I made a brilliant business deal"!
- Mei said nothing — she was too busy polishing the gold.
- Now news travels fast.
- Their neighbor, Jin, soon, heard, that Wei had returned from a big business trip and was now rich.
- His wife heard too"?
- Brilliant deal, eh"?
- she said to him.
- "If that fool Wei can make all that money, why can't you"?
复制代码 |
|