python实现词频分析器的简单介绍

发布时间:2022-11-20

本文目录一览:

1、用Python统计词频
2、Python编程实现csv文件某一列的词频统计
3、如何用python和jieba分词,统计词频?
4、[python问题,我运用python做中文词频分析的时候总是显示UnicodeDecodeError: 'utf-8'问题?](#python问题,我运用python做中文词频分析的时候总是显示UnicodeDecodeError: " utf-8问题?)

用Python统计词频

def statistics(astr):
    # astr.replace("\n", "")
    slist = list(astr.split("\t"))
    alist = []
    [alist.append(i) for i in slist if i not in alist]
    alist[-1] = alist[-1].replace("\n", "")
    return alist
if __name__ == "__main__":
    code_doc = {}
    with open("test_data.txt", "r", encoding='utf-8') as fs:
        for ln in fs.readlines():
            l = statistics(ln)
            for t in l:
                if t not in code_doc:
                    code_doc.setdefault(t, 1)
                else:
                    code_doc[t] += 1
            for keys in code_doc.keys():
                print(keys + ' ' + str(code_doc[keys]))

Python编程实现csv文件某一列的词频统计

import re
import collections
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# 为避免出问题,文件名使用全路径
data = pd.read_csv('XXX.csv')
trainheadlines = []
for row in range(0, len(data.index)):
    trainheadlines.append(' '.join(str(x) for x in data.iloc[row, m:n]))
# 上面的m:n代表取那一列,或者那几列。
advancedvectorizer = TfidfVectorizer(min_df=0, max_df=1, max_features=20000, ngram_range=(1, 1))
advancedtrain = advancedvectorizer.fit_transform(trainheadlines)
print(advancedtrain.shape)

如何用python和jieba分词,统计词频?

#! python3
# -*- coding: utf-8 -*-
import os, codecs
import jieba
from collections import Counter
def get_words(txt):
    seg_list = jieba.cut(txt)
    c = Counter()
    for x in seg_list:
        if len(x) > 1 and x != '\r\n':
            c[x] += 1
    print('常用词频度统计结果')
    for (k, v) in c.most_common(100):
        print('%s%s %s   %d' % (('   '*(5-len(k))), k, '*'*int(v/3), v))
if __name__ == '__main__':
    with codecs.open('19d.txt', 'r', 'utf8') as f:
        txt = f.read()
    get_words(txt)

python问题,我运用python做中文词频分析的时候总是显示UnicodeDecodeError: 'utf-8'问题?

出现原因:文件不是 UTF8 编码的,而系统默认采用 UTF8 解码。 解决方法是改为对应的解码方式。 解决办法: “文件–》另存为”,可以看到文件的默认编码格式为ANSI,改为编码格式UTF8,保存