一、文本清洗与预处理
在自然语言处理中,文本清洗和预处理是必不可少的一步。因为原始文本中包含了各种各样的噪声、特殊符号等,这些都会对后续的自然语言处理产生干扰和误差。以下是一些文本清洗和预处理的技巧:
1、去除非文本部分,例如HTML标签
import re def remove_html_tags(text): """去除HTML标签""" clean = re.compile('<.*?>') return re.sub(clean, '', text) text = 'This is a headline.
This is a paragraph.
' print(remove_html_tags(text))
2、去除特殊字符,如标点符号、数字等
import string def remove_punctuation(text): """去掉标点符号""" return text.translate(str.maketrans('', '', string.punctuation)) text = "Let's try to remove punctuation from this text!" print(remove_punctuation(text))
3、单词分词
import nltk text = "This is a sentence for word tokenization." tokens = nltk.word_tokenize(text) print(tokens)
二、文本特征提取
文本特征提取是自然语言处理中的一个重要概念。在建立自然语言处理模型时,我们需要将文本转换为一些有意义的特征表示。以下是一些文本特征提取的技巧:
1、词袋模型
from sklearn.feature_extraction.text import CountVectorizer corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.toarray())
2、TF-IDF模型
from sklearn.feature_extraction.text import TfidfVectorizer corpus = [ 'This is the first document.', 'This is the second second document.', 'And the third one.', 'Is this the first document?', ] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) print(vectorizer.get_feature_names()) print(X.toarray())
三、文本分类
文本分类是自然语言处理中的一个重要应用。在进行文本分类时,我们需要建立一个分类器,将文本自动归类到预定义的类别中。以下是一些文本分类的技巧:
1、朴素贝叶斯分类器
from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] vectorizer = CountVectorizer() X = vectorizer.fit_transform(corpus) clf = MultinomialNB() clf.fit(X, [1, 1, 2, 2]) test_text = "Is this the third document?" test_vec = vectorizer.transform([test_text]) print(clf.predict(test_vec))
2、支持向量机分类器
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.svm import LinearSVC corpus = [ 'This is the first document.', 'This document is the second document.', 'And this is the third one.', 'Is this the first document?', ] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(corpus) clf = LinearSVC() clf.fit(X, [1, 1, 2, 2]) test_text = "Is this the third document?" test_vec = vectorizer.transform([test_text]) print(clf.predict(test_vec))