谁有兴趣，把这个英语断句改吧改吧，看是否能完善一下呢?,Python交流,编程语言专区,鱼C论坛

blackantt 发表于 2025-6-11 23:05:35

谁有兴趣，把这个英语断句改吧改吧，看是否能完善一下呢?

# -*- coding: utf-8 -*-
import re
alphabets= "()"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([.][.](?:[.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "()"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list:
"""
Split the text into sentences.

If the text contains substrings "<prd>" or "<stop>", they would lead
to incorrect splitting because they are used as markers for splitting.

:param text: text to be split into sentences
:type text: str

:return: list of sentences
:rtype: list
"""
text = " " + text + ""
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")

# text = re.sub(r'([,.?!][\'"»’]*)(\w)', r'\1 \2', text)
text = re.sub(r'([,][\'"»’]*)(\w)', r'\1 \2', text)

# --- 保留并优化原函数中一些有用的规则 ---

# 规则2: 去除标点符号（.?!,;:)之前的多余空格
# 例如："Hello ." -> "Hello."
text = re.sub(r'\s+([.?!,;:])', r'\1', text)
text = re.sub(r' +', ' ', text)

sentences = text.split("<stop>")
sentences =
if sentences and not sentences[-1]: sentences = sentences[:-1]
return sentences

text = """
I love you.I am Tom. Hello,world!How are you ?
Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
"I love this project!" exclaimed Prof. Brown. "It's amazing."
The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
I love you. Tom said.
I love you.Tom said.
“苏超”联赛火爆出圈，场均观众8798人远超中甲，门票一票难求！全民参与+城市荣誉模式激活经济内循环，地域热梗成看点，文旅消费激增，政府主导打造“移动的城市广告”。
网上热度也是相当狂飙。虎扑App紧急新增“江苏联”频道，上线首日访问量破百万；抖音话题#江苏城市联赛#播放量破亿，素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了，甚至有球迷在二手物品交易平台表示，愿意花100元求购一张门票。
But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard, that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"

"""

print(split_into_sentences(text))

某一个“天” 发表于 2025-6-12 13:05:14

必须手搓吗？能用自然语言处理库吗

某一个“天” 发表于 2025-6-12 13:05:50

手搓，我最近没时间{:10_266:}{:10_266:}

blackantt 发表于 2025-6-12 19:30:12

某一个“天” 发表于 2025-6-12 13:05
必须手搓吗？能用自然语言处理库吗

NLTK ， spaCy ， blingfire 这些估计都够呛。不但要考虑正常句子。还要考虑 I am Tom.I love you. It's 3.15.这种缺少空格的情况。

某一个“天” 发表于 2025-6-12 20:38:48

那你搞个这种特殊的字符串，我测试下{:10_256:}

smallwh 发表于 2025-6-27 17:45:04

像这种特殊情况，基于规则的处理也很麻烦。如果句子不规范，很容易引起歧义。
比如同为 3:30 p.m.
在Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.是两个句子，
而在At 3:30 p.m. Tom attended a meeting which was held at example.com.是一个句子。
我的建议是规范输入，或者采用大模型(Chatgpt3.0在这方面表现出色)【基于统计和概率才是文本处理的最终解】。

下面是修改了一些错误，并经Chatgpt3.0优化的代码。但是还会存在一些问题。
import re

# 定义相关的正则表达式模式
alphabets = "()"
prefixes = "(Mr|St|Mrs|Ms|Dr|Prof)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([.][.](?:[.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "()"
multiple_dots = r'\.{2,}'

def split_into_sentences(text):
"""
Split the text into sentences.

:param text: text to be split into sentences
:type text: str

:return: list of sentences
:rtype: list
"""
# 初始化文本
text = " " + text + ""
text = text.replace("\n", " ")

# 处理缩写
text = re.sub(prefixes, "\\1<prd>", text)
text = re.sub(websites, "<prd>\\1", text)
text = re.sub(digits + "[.]" + digits, "\\1<prd>\\2", text)
text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)

# 处理特定缩写（如Ph.D.）
if "Ph.D" in text:
   text = text.replace("Ph.D.", "Ph<prd>D<prd>")

# 处理字母后的点
text = re.sub("\s" + alphabets + "[.] ", " \\1<prd> ", text)

# 处理缩写后的句子拆分
text = re.sub(acronyms + " " + starters, "\\1<stop> \\2", text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>", text)

# 处理后缀
text = re.sub(" " + suffixes + "[.] " + starters, " \\1<stop> \\2", text)
text = re.sub(" " + suffixes + "[.]", " \\1<prd>", text)
text = re.sub(" " + alphabets + "[.]", " \\1<prd>", text)

# 处理引号和标点符号的情况
if "”" in text: text = text.replace(".”", "”.")
if '\"' in text: text = text.replace('."', '".')
if "!" in text: text = text.replace('!"', '"!')
if "?" in text: text = text.replace('?"', '"?')

# 替换句号，问号和感叹号后加 <stop>
text = text.replace(".", ".<stop>")
text = text.replace("?", "?<stop>")
text = text.replace("!", "!<stop>")

# 还原已替换的 <prd> 为句点
text = text.replace("<prd>", ".")

# 处理其他标点符号的间隔问题
text = re.sub(r'([,][\'"»’]*)(\w)', r'\1 \2', text)
text = re.sub(r'\s+([.?!,;:])', r'\1', text)
text = re.sub(r' +', ' ', text)

# 处理中文与英文间的断句问题
text = re.sub(r'(\w)([，。！？])', r'\1 <stop> \2', text)
text = re.sub(r'([。！？])(\w)', r'\1 <stop> \2', text)

# 按照<stop>分割句子
sentences = text.split("<stop>")
sentences =

# 清除空句子
if sentences and not sentences[-1]:
   sentences = sentences[:-1]

return sentences

text = """
I love you.I am Tom.    Hello,world!How are you ?
Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
"I love this project!" exclaimed Prof. Brown. "It's amazing."
The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
I love you. Tom said.
I love you.Tom said.
“苏超”联赛火爆出圈，场均观众8798人远超中甲，门票一票难求！全民参与+城市荣誉模式激活经济内循环，地域热梗成看点，文旅消费激增，政府主导打造“移动的城市广告”。
网上热度也是相当狂飙。虎扑App紧急新增“江苏联”频道，上线首日访问量破百万；抖音话题#江苏城市联赛#播放量破亿，素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了，甚至有球迷在二手物品交易平台表示，愿意花100元求购一张门票。
But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard, that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"

"""

for s in split_into_sentences(text):
print(s)

输出：
I love you.
I am Tom.
Hello, world!
How are you?
Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
"I love this project"!
exclaimed Prof. Brown.
"It's amazing".
The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
I love you.
Tom said.
I love you.
Tom said.
“苏超”联赛火爆出圈
，场均观众8798人远超中甲
，门票一票难求
！
全民参与+城市荣誉模式激活经济内循环
，地域热梗成看点
，文旅消费激增
，政府主导打造“移动的城市广告”。网上热度也是相当狂飙
。虎扑App紧急新增“江苏联”频道
，上线首日访问量破百万；抖音话题#江苏城市联赛#播放量破亿
，素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了
，甚至有球迷在二手物品交易平台表示
，愿意花100元求购一张门票
。 But then Wei poured a shower of gold coins into her lap".
Never mind where I got them", he whispered.
"Let's just say...
I made a brilliant business deal"!
Mei said nothing — she was too busy polishing the gold.
Now news travels fast.
Their neighbor, Jin, soon, heard, that Wei had returned from a big business trip and was now rich.
His wife heard too"?
Brilliant deal, eh"?
she said to him.
"If that fool Wei can make all that money, why can't you"?

页: [1]

鱼C论坛's Archiver

谁有兴趣，把这个英语断句改吧改吧，看是否能完善一下呢?