|
马上注册,结交更多好友,享用更多功能^_^
您需要 登录 才可以下载或查看,没有账号?立即注册
x
本帖最后由 helloTOM 于 2018-5-6 14:39 编辑
- import codecs as cs
- import jieba as jb
- import pandas as pd
- f = cs.open("3k.txt","rb")
- lines = f.readlines()
- f.close()
- data = []
- dic ={}
- for each in lines:
- bad = jb.cut_for_search(each)
- data.append(bad)
- for eachline in data:
- for eachword in eachline:
- if eachword in dic:
- dic[eachword] += 1
- else:
- dic[eachword] = 1
- sorteddic = sorted(dic.items(),key=lambda x:x[1],reverse=True)
- 这一段代码没有问题(代码是本站大神写的)
- 得出的词频里有不少无用的语气词和标点符号
- 我就想用停用词除去这些没用的数据 但是报编码错误。。不知道怎么处理{:5_100:} {:5_100:}
- words_df=pd.DataFrame({"sorteddic":sorteddic})
- stopwords=pd.read_csv("stopwords.txt",index_col=False,quoting=3,sep="\t",names=["stopword"],encoding="utf-8")
- words_df=words_df[~words_df.sorteddic.isin(stopwords.stopword)]
- print(words_df)
复制代码
报错为:UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte |
|