将英文文档集表示为词袋特征向量?怎么用python实现
步骤一,1将文本转换为词列表字符串 split(),多种标点符号:全部统一替换为 空格
2,循环查看每个词若词在字典中存在,加1,词不存在,添加到字典中
二. 生成文档集词典
三. 按照词典词顺序,生成文档词频列表
1,将每个文档的字典key取出转化为集合
2,使用Python内置函数set()合并所有集合得到出现在所有文档中的词将set转换为列表
3,按照文档集词典长度,为每个文档生成词频向量,将文本序列初始化为全0
遍历单个文档的字典,查找每个词在总词典中的位置序号
用词频值为文本列表对应位置复制 仅供参考
s1 = '''When in the Course of human events, it becomes necessary for one people to dissolve the political bands which have connected them with another, and to assume among the Powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires that they should declare the causes which impel them to the separation. '''
s2 = s1.lower()
s2 = s2.replace(',', '')
s2 = s2.replace('.', '')
s2 = s2.replace('\'s', ' is')
s2 = s2.strip()
ls1 = s2.split(' ')
set1 = set(ls1)
dic1 = {}
for x in set1:
dic1 = s2.count(x)
result = sorted(dic1.items(), key=lambda item: item, reverse=True)
print(result) suchocolate 发表于 2020-10-4 08:25
仅供参考
兄弟,非常感谢,给了我很大的启发
页:
[1]