blackantt 发表于 2025-6-11 23:05:35

谁有兴趣,把这个英语断句改吧改吧,看是否能完善一下呢?

# -*- coding: utf-8 -*-
import re
alphabets= "()"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([.][.](?:[.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "()"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list
    """
    text = " " + text + ""
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")




    # text = re.sub(r'([,.?!][\'"»’]*)(\w)', r'\1 \2', text)
    text = re.sub(r'([,][\'"»’]*)(\w)', r'\1 \2', text)

    # --- 保留并优化原函数中一些有用的规则 ---

    # 规则2: 去除标点符号(.?!,;:)之前的多余空格
    # 例如:"Hello ." -> "Hello."
    text = re.sub(r'\s+([.?!,;:])', r'\1', text)
    text = re.sub(r' +', ' ', text)




    sentences = text.split("<stop>")
    sentences =
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences


text = """
I love you.I am Tom.      Hello,world!How are you ?
    Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
    "I love this project!" exclaimed Prof. Brown. "It's amazing."
    The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
    I love you. Tom said.
    I love you.Tom said.
    “苏超”联赛火爆出圈,场均观众8798人远超中甲,门票一票难求!全民参与+城市荣誉模式激活经济内循环,地域热梗成看点,文旅消费激增,政府主导打造“移动的城市广告”。
    网上热度也是相当狂飙。虎扑App紧急新增“江苏联”频道,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了,甚至有球迷在二手物品交易平台表示,愿意花100元求购一张门票。
    But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard,    that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"



"""

print(split_into_sentences(text))

某一个“天” 发表于 2025-6-12 13:05:14

必须手搓吗?能用自然语言处理库吗

某一个“天” 发表于 2025-6-12 13:05:50

手搓,我最近没时间{:10_266:}{:10_266:}

blackantt 发表于 2025-6-12 19:30:12

某一个“天” 发表于 2025-6-12 13:05
必须手搓吗?能用自然语言处理库吗

NLTK , spaCy , blingfire 这些估计都够呛。不但要考虑正常句子。还要考虑 I am Tom.I love you. It's 3.15.这种缺少空格的情况。

某一个“天” 发表于 2025-6-12 20:38:48

那你搞个这种特殊的字符串,我测试下{:10_256:}

smallwh 发表于 6 天前

像这种特殊情况,基于规则的处理也很麻烦。如果句子不规范,很容易引起歧义。
比如同为 3:30 p.m.
在Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.是两个句子,
而在At 3:30 p.m. Tom attended a meeting which was held at example.com.是一个句子。
我的建议是规范输入,或者采用大模型(Chatgpt3.0在这方面表现出色)【基于统计和概率才是文本处理的最终解】。


下面是修改了一些错误,并经Chatgpt3.0优化的代码。但是还会存在一些问题。
import re

# 定义相关的正则表达式模式
alphabets = "()"
prefixes = "(Mr|St|Mrs|Ms|Dr|Prof)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([.][.](?:[.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "()"
multiple_dots = r'\.{2,}'

def split_into_sentences(text):
    """
    Split the text into sentences.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list
    """
    # 初始化文本
    text = " " + text + ""
    text = text.replace("\n", " ")
   
    # 处理缩写
    text = re.sub(prefixes, "\\1<prd>", text)
    text = re.sub(websites, "<prd>\\1", text)
    text = re.sub(digits + "[.]" + digits, "\\1<prd>\\2", text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)

    # 处理特定缩写(如Ph.D.)
    if "Ph.D" in text:
      text = text.replace("Ph.D.", "Ph<prd>D<prd>")

    # 处理字母后的点
    text = re.sub("\s" + alphabets + "[.] ", " \\1<prd> ", text)

    # 处理缩写后的句子拆分
    text = re.sub(acronyms + " " + starters, "\\1<stop> \\2", text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>", text)

    # 处理后缀
    text = re.sub(" " + suffixes + "[.] " + starters, " \\1<stop> \\2", text)
    text = re.sub(" " + suffixes + "[.]", " \\1<prd>", text)
    text = re.sub(" " + alphabets + "[.]", " \\1<prd>", text)

    # 处理引号和标点符号的情况
    if "”" in text: text = text.replace(".”", "”.")
    if '\"' in text: text = text.replace('."', '".')
    if "!" in text: text = text.replace('!"', '"!')
    if "?" in text: text = text.replace('?"', '"?')

    # 替换句号,问号和感叹号后加 <stop>
    text = text.replace(".", ".<stop>")
    text = text.replace("?", "?<stop>")
    text = text.replace("!", "!<stop>")

    # 还原已替换的 <prd> 为句点
    text = text.replace("<prd>", ".")

    # 处理其他标点符号的间隔问题
    text = re.sub(r'([,][\'"»’]*)(\w)', r'\1 \2', text)
    text = re.sub(r'\s+([.?!,;:])', r'\1', text)
    text = re.sub(r' +', ' ', text)

    # 处理中文与英文间的断句问题
    text = re.sub(r'(\w)([,。!?])', r'\1 <stop> \2', text)
    text = re.sub(r'([。!?])(\w)', r'\1 <stop> \2', text)

    # 按照<stop>分割句子
    sentences = text.split("<stop>")
    sentences =
   
    # 清除空句子
    if sentences and not sentences[-1]:
      sentences = sentences[:-1]
   
    return sentences



text = """
I love you.I am Tom.      Hello,world!How are you ?
    Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
    "I love this project!" exclaimed Prof. Brown. "It's amazing."
    The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
    I love you. Tom said.
    I love you.Tom said.
    “苏超”联赛火爆出圈,场均观众8798人远超中甲,门票一票难求!全民参与+城市荣誉模式激活经济内循环,地域热梗成看点,文旅消费激增,政府主导打造“移动的城市广告”。
    网上热度也是相当狂飙。虎扑App紧急新增“江苏联”频道,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了,甚至有球迷在二手物品交易平台表示,愿意花100元求购一张门票。
    But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard,    that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"



"""

for s in split_into_sentences(text):
    print(s)


输出:
I love you.
I am Tom.
Hello, world!
How are you?
Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
"I love this project"!
exclaimed Prof. Brown.
"It's amazing".
The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
I love you.
Tom said.
I love you.
Tom said.
“苏超”联赛火爆出圈
,场均观众8798人远超中甲
,门票一票难求

全民参与+城市荣誉模式激活经济内循环
,地域热梗成看点
,文旅消费激增
,政府主导打造“移动的城市广告”。 网上热度也是相当狂飙
。 虎扑App紧急新增“江苏联”频道
,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿
,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了
,甚至有球迷在二手物品交易平台表示
,愿意花100元求购一张门票
。 But then Wei poured a shower of gold coins into her lap".
Never mind where I got them", he whispered.
"Let's just say...
I made a brilliant business deal"!
Mei said nothing — she was too busy polishing the gold.
Now news travels fast.
Their neighbor, Jin, soon, heard, that Wei had returned from a big business trip and was now rich.
His wife heard too"?
Brilliant deal, eh"?
she said to him.
"If that fool Wei can make all that money, why can't you"?
页: [1]
查看完整版本: 谁有兴趣,把这个英语断句改吧改吧,看是否能完善一下呢?