鱼C论坛

 找回密码
 立即注册
查看: 96|回复: 4

谁有兴趣,把这个英语断句改吧改吧,看是否能完善一下呢?

[复制链接]
发表于 前天 23:05 | 显示全部楼层 |阅读模式

马上注册,结交更多好友,享用更多功能^_^

您需要 登录 才可以下载或查看,没有账号?立即注册

x
  1. # -*- coding: utf-8 -*-
  2. import re
  3. alphabets= "([A-Za-z])"
  4. prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
  5. suffixes = "(Inc|Ltd|Jr|Sr|Co)"
  6. starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
  7. acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
  8. websites = "[.](com|net|org|io|gov|edu|me)"
  9. digits = "([0-9])"
  10. multiple_dots = r'\.{2,}'

  11. def split_into_sentences(text: str) -> list[str]:
  12.     """
  13.     Split the text into sentences.

  14.     If the text contains substrings "<prd>" or "<stop>", they would lead
  15.     to incorrect splitting because they are used as markers for splitting.

  16.     :param text: text to be split into sentences
  17.     :type text: str

  18.     :return: list of sentences
  19.     :rtype: list[str]
  20.     """
  21.     text = " " + text + "  "
  22.     text = text.replace("\n"," ")
  23.     text = re.sub(prefixes,"\\1<prd>",text)
  24.     text = re.sub(websites,"<prd>\\1",text)
  25.     text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
  26.     text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
  27.     if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
  28.     text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
  29.     text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
  30.     text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
  31.     text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
  32.     text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
  33.     text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
  34.     text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
  35.     if "”" in text: text = text.replace(".”","”.")
  36.     if """ in text: text = text.replace("."","".")
  37.     if "!" in text: text = text.replace("!"",""!")
  38.     if "?" in text: text = text.replace("?"",""?")
  39.     text = text.replace(".",".<stop>")
  40.     text = text.replace("?","?<stop>")
  41.     text = text.replace("!","!<stop>")
  42.     text = text.replace("<prd>",".")




  43.     # text = re.sub(r'([,.?!][\'"&#187;’]*)(\w)', r'\1 \2', text)
  44.     text = re.sub(r'([,][\'"&#187;’]*)(\w)', r'\1 \2', text)

  45.     # --- 保留并优化原函数中一些有用的规则 ---

  46.     # 规则2: 去除标点符号(.?!,;:)之前的多余空格
  47.     # 例如:"Hello ." -> "Hello."
  48.     text = re.sub(r'\s+([.?!,;:])', r'\1', text)
  49.     text = re.sub(r' +', ' ', text)




  50.     sentences = text.split("<stop>")
  51.     sentences = [s.strip() for s in sentences]
  52.     if sentences and not sentences[-1]: sentences = sentences[:-1]
  53.     return sentences


  54. text = """
  55. I love you.I am Tom.      Hello,world!How are you ?
  56.     Dr. Smith and Mr. Jones met at 3:30 p.m. The meeting was held at example.com.
  57.     "I love this project!" exclaimed Prof. Brown. "It's amazing."
  58.     The U.S. Department of Energy (DOE) reported a 2.5% increase in 3.15 at Beijing.
  59.     I love you. Tom said.
  60.     I love you.Tom said.
  61.     “苏超”联赛火爆出圈,场均观众8798人远超中甲,门票一票难求!全民参与+城市荣誉模式激活经济内循环,地域热梗成看点,文旅消费激增,政府主导打造“移动的城市广告”。
  62.     网上热度也是相当狂飙。  虎扑App紧急新增“江苏联”频道,上线首日访问量破百万;抖音话题#江苏城市联赛#播放量破亿,素人拍摄的赛事短视频占比超70%……第三轮的门票已经是花钱都抢不到了,甚至有球迷在二手物品交易平台表示,愿意花100元求购一张门票。
  63.     But then Wei poured a shower of gold coins into her lap."Never mind where I got them", he whispered. "Let's just say... I made a brilliant business deal!"Mei said nothing — she was too busy polishing the gold.Now news travels fast.Their neighbor,Jin ,soon, heard,    that Wei had returned from a big business trip and was now rich. His wife heard too?"Brilliant deal, eh?" she said to him. "If that fool Wei can make all that money,why can't you?"



  64. """

  65. print(split_into_sentences(text))
复制代码
小甲鱼最新课程 -> https://ilovefishc.com
回复

使用道具 举报

发表于 昨天 13:05 | 显示全部楼层
必须手搓吗?能用自然语言处理库吗
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 昨天 13:05 | 显示全部楼层
手搓,我最近没时间
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

 楼主| 发表于 昨天 19:30 | 显示全部楼层
某一个“天” 发表于 2025-6-12 13:05
必须手搓吗?能用自然语言处理库吗


NLTK , spaCy , blingfire 这些估计都够呛。  不但要考虑正常句子。还要考虑 I am Tom.I love you. It's 3.15.  这种缺少空格的情况。
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

发表于 昨天 20:38 | 显示全部楼层
那你搞个这种特殊的字符串,我测试下
小甲鱼最新课程 -> https://ilovefishc.com
回复 支持 反对

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

小黑屋|手机版|Archiver|鱼C工作室 ( 粤ICP备18085999号-1 | 粤公网安备 44051102000585号)

GMT+8, 2025-6-13 09:43

Powered by Discuz! X3.4

© 2001-2023 Discuz! Team.

快速回复 返回顶部 返回列表