鱼C论坛

 找回密码
 立即注册
查看: 238|回复: 5

谁有兴趣,把这个英语断句改吧改吧,看是否能完善一下呢?

[复制链接]
发表于 2025-6-11 23:05:35 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
  1. # -*- coding: utf-8 -*-
  2. import re
  3. alphabets= "([A-Za-z])"
  4. prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
  5. suffixes = "(Inc|Ltd|Jr|Sr|Co)"
  6. starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
  7. acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
  8. websites = "[.](com|net|org|io|gov|edu|me)"
  9. digits = "([0-9])"
  10. multiple_dots = r'\.{2,}'

  11. def split_into_sentences(text: str) -> list[str]:
  12.     """
  13.     Split the text into sentences.

  14.     If the text contains substrings "<prd>" or "<stop>", they would lead
  15.     to incorrect splitting because they are used as markers for splitting.

  16.     :param text: text to be split into sentences
  17.     :type text: str

  18.     :return: list of sentences
  19.     :rtype: list[str]
  20.     """
  21.     text = " " + text + "  "
  22.     text = text.replace("\n"," ")
  23.     text = re.sub(prefixes,"\\1<prd>",text)
  24.     text = re.sub(websites,"<prd>\\1",text)
  25.     text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
  26.     text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
  27.     if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
  28.     text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
  29.     text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
  30.     text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
  31.     text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
  32.     text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
  33.     text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
  34.     text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
  35.     if "”" in text: text = text.replace(".”","”.")
  36.     if """ in text: text = text.replace("."","".")
  37.     if "!" in text: text = text.replace("!"",""!")
  38.     if "?" in text: text = text.replace("?"",""?")
  39.     text = text.replace(".",".<stop>")
  40.     text = text.replace("?","?<stop>")
  41.     text = text.replace("!","!<stop>")
  42.     text = text.replace("<prd>",".")




  43.     # text = re.sub(r'([,.?!][\'"&#187;’]*)(\w)', r'\1 \2', text)
  44.     text = re.sub(r'([,][\'"&#187;’]*)(\w)', r'\1 \2', text)

  45.     # --- 保留并优化原函数中一些有用的规则 ---

  46.     # 规则2: 去除标点符号(.?!,;:)之前的多余空格
  47.     # 例如:"Hello ." -> "Hello."
  48.     text = re.sub(r'\s+([.?!,;:])', r'\1', text)
  49.     text = re.sub(r' +', ' ', text)




  50.     sentences = text.split("<stop>")
  51.     sentences = [s.strip() for s in sentences]
  52.     if sentences and not sentences[-1]: sentences = sentences[:-1]
  53.     return sentences


  54. text = """
  55. I love you.I am Tom.      Hello,world!How are you ?
  56.     Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
  57.     "I love this project!" exclaimed Prof. Brown. "It's amazing."
  58.     The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
  59.     I love you. Tom said.
  60.     I love you.Tom said.
  61.     “苏超”联赛火爆出圈,场均观众8798人远超中甲,门票一票难求!全民参与+城市荣誉模式激活经济内循环,地域热梗成看点,文旅消费激增,政府主导打造“移动的城市广告”。
  62.     网上热度也是相当狂飙。  虎扑App紧急新增“江苏联”频道,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了,甚至有球迷在二手物品交易平台表示,愿意花100元求购一张门票。
  63.     But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard,    that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"



  64. """

  65. print(split_into_sentences(text))
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 2025-6-12 13:05:14 | 显示全部楼层
必须手搓吗?能用自然语言处理库吗
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2025-6-12 13:05:50 | 显示全部楼层
手搓,我最近没时间
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 2025-6-12 19:30:12 | 显示全部楼层
某一个“天” 发表于 2025-6-12 13:05
必须手搓吗?能用自然语言处理库吗


NLTK , spaCy , blingfire 这些估计都够呛。  不但要考虑正常句子。还要考虑 I am Tom.I love you. It's 3.15.  这种缺少空格的情况。
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 2025-6-12 20:38:48 | 显示全部楼层
那你搞个这种特殊的字符串,我测试下
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 6 天前 | 显示全部楼层
像这种特殊情况,基于规则的处理也很麻烦。如果句子不规范,很容易引起歧义。
比如同为 3:30 p.m.
Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.是两个句子,
而在At 3:30 p.m. Tom attended a meeting which was held at example.com.是一个句子。
我的建议是规范输入,或者采用大模型(Chatgpt3.0在这方面表现出色)【基于统计和概率才是文本处理的最终解】


下面是修改了一些错误,并经Chatgpt3.0优化的代码。但是还会存在一些问题。
  1. import re

  2. # 定义相关的正则表达式模式
  3. alphabets = "([A-Za-z])"
  4. prefixes = "(Mr|St|Mrs|Ms|Dr|Prof)[.]"
  5. suffixes = "(Inc|Ltd|Jr|Sr|Co)"
  6. starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
  7. acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
  8. websites = "[.](com|net|org|io|gov|edu|me)"
  9. digits = "([0-9])"
  10. multiple_dots = r'\.{2,}'

  11. def split_into_sentences(text):
  12.     """
  13.     Split the text into sentences.

  14.     :param text: text to be split into sentences
  15.     :type text: str

  16.     :return: list of sentences
  17.     :rtype: list[str]
  18.     """
  19.     # 初始化文本
  20.     text = " " + text + "  "
  21.     text = text.replace("\n", " ")
  22.    
  23.     # 处理缩写
  24.     text = re.sub(prefixes, "\\1<prd>", text)
  25.     text = re.sub(websites, "<prd>\\1", text)
  26.     text = re.sub(digits + "[.]" + digits, "\\1<prd>\\2", text)
  27.     text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)

  28.     # 处理特定缩写(如Ph.D.)
  29.     if "Ph.D" in text:
  30.         text = text.replace("Ph.D.", "Ph<prd>D<prd>")

  31.     # 处理字母后的点
  32.     text = re.sub("\s" + alphabets + "[.] ", " \\1<prd> ", text)

  33.     # 处理缩写后的句子拆分
  34.     text = re.sub(acronyms + " " + starters, "\\1<stop> \\2", text)
  35.     text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>\\3<prd>", text)
  36.     text = re.sub(alphabets + "[.]" + alphabets + "[.]", "\\1<prd>\\2<prd>", text)

  37.     # 处理后缀
  38.     text = re.sub(" " + suffixes + "[.] " + starters, " \\1<stop> \\2", text)
  39.     text = re.sub(" " + suffixes + "[.]", " \\1<prd>", text)
  40.     text = re.sub(" " + alphabets + "[.]", " \\1<prd>", text)

  41.     # 处理引号和标点符号的情况
  42.     if "”" in text: text = text.replace(".”", "”.")
  43.     if '"' in text: text = text.replace('."', '".')
  44.     if "!" in text: text = text.replace('!"', '"!')
  45.     if "?" in text: text = text.replace('?"', '"?')

  46.     # 替换句号,问号和感叹号后加 <stop>
  47.     text = text.replace(".", ".<stop>")
  48.     text = text.replace("?", "?<stop>")
  49.     text = text.replace("!", "!<stop>")

  50.     # 还原已替换的 <prd> 为句点
  51.     text = text.replace("<prd>", ".")

  52.     # 处理其他标点符号的间隔问题
  53.     text = re.sub(r'([,][\'"&#187;’]*)(\w)', r'\1 \2', text)
  54.     text = re.sub(r'\s+([.?!,;:])', r'\1', text)
  55.     text = re.sub(r' +', ' ', text)

  56.     # 处理中文与英文间的断句问题
  57.     text = re.sub(r'(\w)([,。!?])', r'\1 <stop> \2', text)
  58.     text = re.sub(r'([。!?])(\w)', r'\1 <stop> \2', text)

  59.     # 按照<stop>分割句子
  60.     sentences = text.split("<stop>")
  61.     sentences = [s.strip() for s in sentences]
  62.    
  63.     # 清除空句子
  64.     if sentences and not sentences[-1]:
  65.         sentences = sentences[:-1]
  66.    
  67.     return sentences



  68. text = """
  69. I love you.I am Tom.      Hello,world!How are you ?
  70.     Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
  71.     "I love this project!" exclaimed Prof. Brown. "It's amazing."
  72.     The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
  73.     I love you. Tom said.
  74.     I love you.Tom said.
  75.     “苏超”联赛火爆出圈,场均观众8798人远超中甲,门票一票难求!全民参与+城市荣誉模式激活经济内循环,地域热梗成看点,文旅消费激增,政府主导打造“移动的城市广告”。
  76.     网上热度也是相当狂飙。  虎扑App紧急新增“江苏联”频道,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了,甚至有球迷在二手物品交易平台表示,愿意花100元求购一张门票。
  77.     But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard,    that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"



  78. """

  79. for s in split_into_sentences(text):
  80.     print(s)
复制代码


输出:
  1. I love you.
  2. I am Tom.
  3. Hello, world!
  4. How are you?
  5. Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
  6. "I love this project"!
  7. exclaimed Prof. Brown.
  8. "It's amazing".
  9. The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
  10. I love you.
  11. Tom said.
  12. I love you.
  13. Tom said.
  14. “苏超”联赛火爆出圈
  15. ,场均观众8798人远超中甲
  16. ,门票一票难求

  17. 全民参与+城市荣誉模式激活经济内循环
  18. ,地域热梗成看点
  19. ,文旅消费激增
  20. ,政府主导打造“移动的城市广告”。 网上热度也是相当狂飙
  21. 。 虎扑App紧急新增“江苏联”频道
  22. ,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿
  23. ,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了
  24. ,甚至有球迷在二手物品交易平台表示
  25. ,愿意花100元求购一张门票
  26. 。 But then Wei poured a shower of gold coins into her lap".
  27. Never mind where I got them", he whispered.
  28. "Let's just say...
  29. I made a brilliant business deal"!
  30. Mei said nothing — she was too busy polishing the gold.
  31. Now news travels fast.
  32. Their neighbor, Jin, soon, heard, that Wei had returned from a big business trip and was now rich.
  33. His wife heard too"?
  34. Brilliant deal, eh"?
  35. she said to him.
  36. "If that fool Wei can make all that money, why can't you"?
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-7-3 20:51

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表