[已解决]fasttext怎么不管用？结果错的太离谱。。。。。。

blackantt · 发表于 2023-5-13 10:00:41

马上注册，结交更多好友，享用更多功能^_^

您需要登录才可以下载或查看，没有账号？立即注册

x

本帖最后由 blackantt 于 2023-5-13 10:06 编辑

en.zip (17.18 KB, 下载次数: 2)

import fasttext
from blingfire import text_to_sentences

fasttext.FastText.eprint = lambda x: None    #避免出现告警， Warning : `load_model` does not return WordVectorModel or SupervisedModel any more, but a `FastText`
# 加载预训练的模型
model = fasttext.load_model("./lid.176.bin")

txt = """

"""

txt1 = """

"""

# print(txt)
txt = txt.replace('\n', ' ').replace('“', ' ').replace('”', ' ').replace('？', ' ').replace('‘', " ").replace('’', " ").replace('！', ' ').replace('。', ' ').replace('…', ' ').replace('. . .', '').replace('...','...').replace('–','-')
txt1 = txt1.replace('\n', ' ').replace('“', ' ').replace('”', ' ').replace('？', ' ').replace('‘', " ").replace('’', " ").replace('！', ' ').replace('。', ' ').replace('…', ' ').replace('. . .', '').replace('...','...').replace('–','-')

language_detected1 = model.predict(txt1, k=1)
if language_detected1[0][0] == '__label__en':          #只允许英文句子加入句子列表，如果要抽取其它语种的，在这里修改
print('--------------------------------------------- en')                                                             #得到单页文本
elif language_detected1[0][0] == '__label__zh':       #其它语种跳过
print('********************************************* cn')
else:                                                 #其它语种跳过
print('unknown, maybe 汉语标点用在英文中了')

最佳答案

月排行榜 / 总排行榜

陶远航

2023-5-13 10:44:13

首先确认您已经正确安装了fasttext和blingfire库。如果没有，请运行以下命令安装：

pip install fasttext blingfire

确保您的lid.176.bin文件存在于指定目录中。如果找不到该文件，请从这里下载。

我对代码进行了格式优化，并添加了对输入文本的一些示例：

import fasttext

# Suppress warning messages
fasttext.FastText.eprint = lambda x: None

# Load the pre-trained model
model = fasttext.load_model("./lid.176.bin")

txt = """
This is a test sentence in English.
"""

txt1 = """
这是一段中文测试句子。
"""

def clean_text(text):
    replacements = {
        "\n": " ", "“": " ", "”": " ", "？": " ", "‘": " ",
        "’": " ", "！": " ", "。": " ", "…": " ", ". . .": "",
        "...": "...", "–": "-"
    }
    for old, new_ in replacements.items():
        text = text.replace(old, new_)
    return text

txt = clean_text(txt)
txt1 = clean_text(txt1)

language_detected1 = model.predict(txt1, k=1)

if language_detected1[0][0] == '__label__en':
    print('--------------------------------------------- en')
elif language_detected1[0][0] == '__label__zh':
    print('********************************************* cn')
else:
    print('unknown, maybe 汉语标点用在英文中了')

注意这里只对txt1进行了语言检测，如果您希望检测更多的文本，请将它们添加到适当的位置并进行相应的修改。对于自动检测和输出结果，可以根据需要将代码封装成函数。

如果你的代码还有任何错误或者不对的地方，请告诉我

跳转到最佳答案楼层

isdkz · 发表于 2023-5-13 10:00:52

很抱歉，您的问题并没有提供足够的信息或代码来让我理解您的问题。请提供更多的细节和代码，以便我可以帮助您解决问题。

陶远航 · 发表于 2023-5-13 10:01:12

isdkz 发表于 2023-5-13 10:00
很抱歉，您的问题并没有提供足够的信息或代码来让我理解您的问题。请提供更多的细节和代码，以便我可以帮助 ...

你这。。。

blackantt · 发表于 2023-5-13 10:01:39

isdkz 发表于 2023-5-13 10:00
很抱歉，您的问题并没有提供足够的信息或代码来让我理解您的问题。请提供更多的细节和代码，以便我可以帮助 ...

说文本里有违禁词，发不了。发zip文件

陶远航 · 发表于 2023-5-13 10:04:54

blackantt 发表于 2023-5-13 10:01
说文本里有违禁词，发不了。发zip文件

出现错误的原因可能是你使用 FastText 进行文本分类时，没有选用合适的训练集来进行模型训练，或者训练集的数量太少。此外，你还需要对输入的文本进行预处理，如去除标点符号、停用词等，以保证输入的数据质量。建议检查数据集和数据预处理的步骤，以及模型的训练参数是否设置合理。

blackantt · 发表于 2023-5-13 10:07:30

陶远航发表于 2023-5-13 10:04
出现错误的原因可能是你使用 FastText 进行文本分类时，没有选用合适的训练集来进行模型训练，或者训练集 ...

该做的都做了啊。你看我的 zip 包里的 py 文件

陶远航 · 发表于 2023-5-13 10:18:04

blackantt 发表于 2023-5-13 10:07
该做的都做了啊。你看我的 zip 包里的 py 文件

你把错误信息复制过来

blackantt · 发表于 2023-5-13 10:19:43

本帖最后由 blackantt 于 2023-5-13 10:20 编辑

陶远航发表于 2023-5-13 10:18
你把错误信息复制过来

没有错误信息，它说 "Rosenstockcén fáth a bhfuil tú ag stánadh?éirigh as ...scanróidh tú an ghéwhy in some cases republish it, but only in accordance with the terms 微软必应搜索引擎已经全面升级" 这种句子是英文

陶远航 · 发表于 2023-5-13 10:44:13

首先确认您已经正确安装了fasttext和blingfire库。如果没有，请运行以下命令安装：

pip install fasttext blingfire

确保您的lid.176.bin文件存在于指定目录中。如果找不到该文件，请从这里下载。

我对代码进行了格式优化，并添加了对输入文本的一些示例：

import fasttext

# Suppress warning messages
fasttext.FastText.eprint = lambda x: None

# Load the pre-trained model
model = fasttext.load_model("./lid.176.bin")

txt = """
This is a test sentence in English.
"""

txt1 = """
这是一段中文测试句子。
"""

def clean_text(text):
    replacements = {
        "\n": " ", "“": " ", "”": " ", "？": " ", "‘": " ",
        "’": " ", "！": " ", "。": " ", "…": " ", ". . .": "",
        "...": "...", "–": "-"
    }
    for old, new_ in replacements.items():
        text = text.replace(old, new_)
    return text

txt = clean_text(txt)
txt1 = clean_text(txt1)

language_detected1 = model.predict(txt1, k=1)

if language_detected1[0][0] == '__label__en':
    print('--------------------------------------------- en')
elif language_detected1[0][0] == '__label__zh':
    print('********************************************* cn')
else:
    print('unknown, maybe 汉语标点用在英文中了')

注意这里只对txt1进行了语言检测，如果您希望检测更多的文本，请将它们添加到适当的位置并进行相应的修改。对于自动检测和输出结果，可以根据需要将代码封装成函数。

如果你的代码还有任何错误或者不对的地方，请告诉我

陶远航 · 发表于 2023-5-13 10:45:23

如果还不行，试下这个方法来加载模型：

model = fasttext.load_model('lid.176.bin', encoding='utf-8')

blackantt · 发表于 2023-5-13 11:04:38

陶远航发表于 2023-5-13 10:44
首先确认您已经正确安装了fasttext和blingfire库。如果没有，请运行以下命令安装：

确保您的lid.176.bin ...

import fasttext

# Suppress warning messages
fasttext.FastText.eprint = lambda x: None

# Load the pre-trained model
model = fasttext.load_model("./lid.176.bin")

txt = """Rosenstockcén fáth a bhfuil tú ag stánadh?éirigh as  ...scanróidh tú an ghéwhy in some cases republish it, but  only in accordance with the terms 微软必应搜索引擎已经全面升级"""

def clean_text(text):
    replacements = {
        "\n": " ", "“": " ", "”": " ", "？": " ", "‘": " ",
        "’": " ", "！": " ", "。": " ", "…": " ", ". . .": "",
        "...": "...", "–": "-"
    }
    for old, new_ in replacements.items():
        text = text.replace(old, new_)
    return text

txt1 = clean_text(txt)

language_detected1 = model.predict(txt1, k=1)

if language_detected1[0][0] == '__label__en':
    print('--------------------------------------------- en')
elif language_detected1[0][0] == '__label__zh':
    print('********************************************* cn')
else:
    print('unknown, maybe 汉语标点用在英文中了')

PS C:\Users\dengz\Downloads\pdf\test4\2> & C:/Python310/python.exe c:/Users/dengz/Downloads/pdf/test4/2/test11.py
--------------------------------------------- en

太奇怪了！不知道问题出在哪

陶远航 · 发表于 2023-5-13 11:06:24

blackantt 发表于 2023-5-13 11:04
PS C:%users\dengz\Downloads\pdf\test4\2> & C:/Python310/python.exe c:/Users/dengz/Downloads/pd ...

那你看看第二个解决方法

blackantt · 发表于 2023-5-13 11:10:07

陶远航发表于 2023-5-13 11:06
那你看看第二个解决方法

aceback (most recent call last):
File "c:\Users\dengz\Downloads\pdf\test4\2\test11.py", line 8, in <module>
model = fasttext.load_model('lid.176.bin', encoding='utf-8')
TypeError: load_model() got an unexpected keyword argument 'encoding'

陶远航 · 发表于 2023-5-13 11:25:04

blackantt 发表于 2023-5-13 11:10
aceback (most recent call last):
File "c:%users\dengz\Downloads\pdf\test4\2\test11.py", line 8, ...

太奇怪了，你可以叫@isdkz 给你帮忙

blackantt · 发表于 2023-5-13 12:00:43

陶远航发表于 2023-5-13 11:25
太奇怪了，你可以叫@isdkz 给你帮忙

可能对混合文本的检测会出问题。我用非打印字符检查是否纯英文将就用算了。谢谢

陶远航 · 发表于 2023-5-13 12:02:08

blackantt 发表于 2023-5-13 12:00
可能对混合文本的检测会出问题。我用非打印字符检查是否纯英文将就用算了。谢谢

那您解决了是否可以给我最佳答案呢

歌者文明清理员 · 发表于 2023-5-13 12:31:31

陶远航发表于 2023-5-13 12:02
那您解决了是否可以给我最佳答案呢

不可以

sfqxx_小 · 发表于 2023-5-13 12:51:00

歌者文明清理员发表于 2023-5-13 12:31
不可以

同意，(给我点评分吧，我马上升级了)

歌者文明清理员 · 发表于 2023-5-13 12:54:10

sfqxx_小发表于 2023-5-13 12:51
同意，(给我点评分吧，我马上升级了)

完了，多评了1贡献

sfqxx_小 · 发表于 2023-5-13 12:54:32

歌者文明清理员发表于 2023-5-13 12:54
完了，多评了1贡献

账号		自动登录	找回密码
密码			立即注册