自然语言处理：网购商品评论情感判定

关注

发布于: 2021 年 04 月 07 日

自然语言处理（Natural Language Processing，简称 NLP），是为各类企业及开发者提供的用于文本分析及挖掘的核心工具，旨在帮助用户高效的处理文本，已经广泛应用在电商、文娱、司法、公安、金融、医疗、电力等行业客户的多项业务中，取得了良好的效果。

1、项目背景

任何行业领域，用户对产品的评价都显得尤为重要。通过用户评论，可以对用户情感倾向进行判定。

例如，目前最为普遍的网购行为：对于用户来说，参考评论可以做出更优的购买决策；对于商家来说，对商品评论按照情感倾向进行分类，并通过文本聚类得到普遍提及的商品优缺点，可以进一步改良产品。

本案例主要讨论如何对商品评论进行情感倾向判定。下图为某电商平台上针对某款手机的部分评论：

2、数据集

这份某款手机的商品评论信息数据集，包含 2 个属性，共计 8187 个样本。

使用Pandas中的read_excel函数读取xls格式的数据集文件，注意文件的编码设置为gb18030，代码如下所示：

import pandas as pd
#读入数据集data = pd.read_excel("data.xls", encoding='gb18030')print(data.head())

复制代码

读取数据集效果（部分）如下所示：

查看数据集的相关信息，包括行列数，列名，以及各个类别的样本数，实现代码如下所示：

# 数据集的大小print(data.shape)
# 数据集的列名print(data.columns.values)
# 不同类别数据记录的统计print(data['Class'].value_counts())

复制代码

效果如下所示

(8186, 2)
array([u'Comment', u'Class'], dtype=object)
 1    3042-1    2657 0    2487Name: Class, dtype: int64

复制代码

3、数据预处理

现在，我们要将Comment列的文本信息，转化成数值矩阵表示，也就是将文本映射到特征空间。

首先，通过jieba，使用HMM模型，对文本进行中文分词，实现代码如下所示：

# 导入中文分词库jiebaimport jiebaimport numpy as np

复制代码

接下来，对数据集的每个样本的文本进行中文分词，如遇到缺失值，使用“还行、一般吧”进行填充，实现代码如下所示：

cutted = []for row in data.values:    try:        raw_words = (" ".join(jieba.cut(row[0])))        cutted.append(raw_words)    except AttributeError:        print row[0]        cutted.append(u"还行 一般吧")
cutted_array = np.array(cutted)
# 生成新数据文件，Comment字段为分词后的内容data_cutted = pd.DataFrame({    'Comment': cutted_array,    'Class': data['Class']})

复制代码

读取并查看预处理后的数据，实现代码如下所示：

print(data_cutted.head())

复制代码

数据集效果（部分）如下所示：

为了更直观地观察词频高的词语，我们使用第三方库wordcloud进行文本的可视化，导入库实现代码如下所示：

# 导入第三方库wordcloud
from wordcloud import WordCloudimport matplotlib.pyplot as plt

复制代码

针对好评，中评和差评的文本，建立WordCloud对象，绘制词云，好评词云可视化实现代码如下所示：

# 好评wc = WordCloud(font_path='Courier.ttf')wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]))plt.axis('off')plt.imshow(wc)plt.show()

复制代码

好评词云效果如下所示：

中评词云可视化实现代码如下所示：

# 中评
wc = WordCloud(font_path='Courier.ttf')wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]))plt.axis('off')plt.imshow(wc)plt.show()

复制代码

中评词云效果如下所示：

差评词云可视化实现代码如下所示：

# 差评
wc = WordCloud(font_path='Courier.ttf')wc.generate(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]))plt.axis('off')plt.imshow(wc)plt.show()

复制代码

差评词云效果如下所示：

从词云展现的词频统计图来看，"手机"，"就是"，"屏幕"，"收到"等词对于区分毫无帮助而且会造成偏差。因此，需要把这些对区分类没有意义的词语筛选出来，放到停用词文件stopwords.txt中。实现代码如下所示：

# 读入停用词文件import codecs
with codecs.open('stopwords.txt', 'r', encoding='utf-8') as f:    stopwords = [item.strip() for item in f]    for item in stopwords[0:200]:    print(item,)

复制代码

输出停用词效果如下所示：

使用jieba库的extract_tags函数，统计好评，中评，差评文本中的TOP20关键词。

#设定停用词文件,在统计关键词的时候，过滤停用词import jieba.analyse
jieba.analyse.set_stop_words('stopwords.txt')

复制代码

好评关键词分析，实现代码如下所示：

# 好评关键词keywords_pos = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 1]), topK=20)for item in keywords_pos:    print（item,）

复制代码

好评关键词TOP20如下所示：

不错 正品 赠品 五分 发货 东西 满意 机子 喜欢 收到 很漂亮 充电 好评 很快 卖家 速度 评价 流畅 快递 物流

复制代码

中评关键词分析，实现代码如下所示：

#中评关键词keywords_med = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == 0]), topK=20)for item in keywords_med:    print(item,)

复制代码

中评关键词TOP20如下所示：

充电 不错 发热 外观 感觉 电池 机子 问题 赠品 有点 无线 发烫 换货 软件 快递 安卓 内存 退货 知道 售后

复制代码

差评关键词分析，实现代码如下所示：

#差评关键词keywords_neg = jieba.analyse.extract_tags(''.join(data_cutted['Comment'][data_cutted['Class'] == -1]), topK=20)
for item in keywords_neg:    print（item,）

复制代码

差评关键词TOP20如下所示：

差评 售后 垃圾 赠品 退货 问题 换货 充电 降价 发票 充电器 东西 刚买 发热 无线 机子 死机 收到 质量 15

复制代码

经过以上步骤的处理，整个数据集的预处理工作“告一段落”。在中文文本分析和情感分析的工作中，数据预处理的内容主要是分词。只有经过分词处理后的文本数据集才可以进行下一步的向量化操作，满足输入模型的条件。

4、基于 SVM 的情感分类模型

经过分词之后的文本数据集要先进行向量化之后才能输入到分类模型中进行运算。

我们使用sklearn库实现向量化方法，去掉停用词，并将其通过tf，tf-idf映射到特征空间。

其中，tftf 为词频，即分词后每个词项在该条评论中出现的次数；dfdf 为出现该词项评论数目；NN 为评论总数，使用对数来适当抑制 tftf 和 dfdf 值的影响。

我们使用sklearn库中的函数直接实现 SVM 算法。在这里，我们选取以下形式的 SVM 模型参与运算。

为了方便，创建文本情感分析类CommentClassifier，来实现建模过程：

__init__为类的初始化函数，输入参数classifier_type和vector_type,分别代表分类模型的类型和向量化方法的类型。
fit()函数，来实现向量化与模型建立的过程。

实现代码如下所示：

# 实现向量化方法from sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.feature_extraction.text import CountVectorizer
#实现svm和贝叶斯模型from sklearn.svm import SVCfrom sklearn.svm import LinearSVCfrom sklearn.linear_model import SGDClassifier

# 实现交叉验证from sklearn.cross_validation import train_test_splitfrom sklearn.cross_validation import cross_val_score
# 实现评价指标from sklearn import metrics
# 文本情感分类的类：CommentClassifierclass CommentClassifier:    def __init__(self, classifier_type, vector_type):        self.classifier_type = classifier_type #分类器类型：支持向量机或贝叶斯分类        self.vector_type = vector_type         #文本向量化模型：0\1模型,TF模型,TF-IDF模型
    def fit(self, train_x, train_y, max_df):        list_text = list(train_x)                #向量化方法：0 - 0/1,1 - TF,2 - TF-IDF        if self.vector_type == 0:            self.vectorizer = CountVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)        elif self.vector_type == 1:            self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3), use_idf=False).fit(list_text)        else:            self.vectorizer = TfidfVectorizer(max_df, stop_words = stopwords, ngram_range=(1, 3)).fit(list_text)
        self.array_trainx = self.vectorizer.transform(list_text)        self.array_trainy = train_y
        #分类模型选择：1 - SVC,2 - LinearSVC,3 - SGDClassifier，三种SVM模型          if self.classifier_type == 1:            self.model = SVC(kernel='linear', gamma=10 ** -5, C=1).fit(self.array_trainx, self.array_trainy)        elif self.classifier_type == 2:            self.model = LinearSVC().fit(self.array_trainx, self.array_trainy)        else:            self.model = SGDClassifier().fit(self.array_trainx, self.array_trainy)            def predict_value(self, test_x):        list_text = list(test_x)        self.array_testx = self.vectorizer.transform(list_text)        array_predict = self.model.predict(self.array_testx)        return array_predict
    def predict_proba(self, test_x):        list_text = list(test_x)        self.array_testx = self.vectorizer.transform(list_text)        array_score = self.model.predict_proba(self.array_testx)        return array_score

复制代码

使用train_test_split()函数划分训练集和测试集。训练集：80%；测试集：20%。
建立classifier_type和vector_type两个参数的取值列表，来表示选择的向量化方法以及分类模型
输出每种向量化方法和分类模型的组合所对应的分类评价结果，内容包括混淆矩阵以及含Precision、Recall和F1-score三个指标的评分矩阵

实现代码如下所示：

#划分训练集，测试集train_x, test_x, train_y, test_y = train_test_split(data_cutted['Comment'].ravel().astype('U'), data_cutted['Class'].ravel(),                                                        test_size=0.2, random_state=4)
classifier_list = [1,2,3]vector_list = [0,1,2]
for classifier_type in classifier_list:    for vector_type in vector_list:        commentCls = CommentClassifier(classifier_type, vector_type)        #max_df 设置为0.98        commentCls.fit(train_x, train_y, 0.98)        if classifier_type == 0:            value_result = commentCls.predict_value(test_x)            proba_result = commentCls.predict_proba(test_x)            print(classifier_type,vector_type)            print('classification report')            print(metrics.classification_report(test_y, value_result, labels=[-1, 0, 1]))            print('confusion matrix')            print(metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1]))        else:            value_result = commentCls.predict_value(test_x)            print(classifier_type,vector_type)            print('classification report')            print(metrics.classification_report(test_y, value_result, labels=[-1, 0, 1]))            print('confusion matrix')            print(metrics.confusion_matrix(test_y, value_result, labels=[-1, 0, 1]))

复制代码

输出效果如下所示：

1 0classification report             precision    recall  f1-score   support
         -1       0.68      0.62      0.65       519          0       0.55      0.49      0.52       485          1       0.75      0.86      0.80       634
avg / total       0.67      0.68      0.67      1638
confusion matrix[[324 130  65] [131 236 118] [ 24  64 546]]1 1classification report             precision    recall  f1-score   support
         -1       0.71      0.74      0.72       519          0       0.58      0.54      0.56       485          1       0.84      0.85      0.85       634
avg / total       0.72      0.72      0.72      1638
confusion matrix[[385 109  25] [145 263  77] [ 15  80 539]]1 2classification report             precision    recall  f1-score   support
         -1       0.70      0.74      0.72       519          0       0.58      0.52      0.55       485          1       0.84      0.86      0.85       634
avg / total       0.72      0.72      0.72      1638
confusion matrix[[386 106  27] [151 254  80] [ 14  76 544]]2 0classification report             precision    recall  f1-score   support
         -1       0.70      0.62      0.66       519          0       0.56      0.51      0.54       485          1       0.76      0.88      0.82       634
avg / total       0.68      0.69      0.68      1638
confusion matrix[[320 135  64] [122 248 115] [ 16  57 561]]2 1classification report             precision    recall  f1-score   support
         -1       0.69      0.73      0.71       519          0       0.61      0.48      0.54       485          1       0.81      0.91      0.86       634
avg / total       0.71      0.73      0.72      1638
confusion matrix[[377 108  34] [154 233  98] [ 12  44 578]]2 2classification report             precision    recall  f1-score   support
         -1       0.70      0.74      0.72       519          0       0.61      0.50      0.55       485          1       0.83      0.91      0.87       634
avg / total       0.72      0.73      0.73      1638
confusion matrix[[383 108  28] [154 241  90] [ 13  43 578]]3 0classification report             precision    recall  f1-score   support
         -1       0.69      0.69      0.69       519          0       0.58      0.47      0.52       485          1       0.79      0.90      0.84       634
avg / total       0.70      0.71      0.70      1638
confusion matrix[[359 118  42] [148 228 109] [ 14  47 573]]3 1classification report             precision    recall  f1-score   support
         -1       0.70      0.74      0.72       519          0       0.60      0.49      0.54       485          1       0.81      0.88      0.84       634
avg / total       0.71      0.72      0.71      1638
confusion matrix[[386  96  37] [152 240  93] [ 13  66 555]]3 2classification report             precision    recall  f1-score   support
         -1       0.65      0.75      0.69       519          0       0.63      0.49      0.55       485          1       0.83      0.86      0.85       634
avg / total       0.71      0.72      0.71      1638
confusion matrix[[389  98  32] [169 236  80] [ 45  41 548]]

复制代码

从结果上来看，选择tfidf向量化方法，使用LinearSVC模型效果比较好，f1-socre为 0.73

从混淆矩阵来看，我们会发现多数的错误分类都出现在中评和差评上。我们可以将原始数据集的中评删除。实现代码如下所示：

data_bi = data_cutted[data_cutted['Class'] != 0]data_bi['Class'].value_counts()

复制代码

效果如下所示：

 1    3042-1    2658Name: Class, dtype: int64

复制代码

再次运行分类模型，查看分类结果，如下所示：

1 0classification report             precision    recall  f1-score   support
         -1       0.90      0.79      0.84       537          1       0.83      0.92      0.87       603
avg / total       0.86      0.86      0.86      1140
confusion matrix[[425 112] [ 48 555]]1 1classification report             precision    recall  f1-score   support
         -1       0.87      0.92      0.90       537          1       0.93      0.88      0.90       603
avg / total       0.90      0.90      0.90      1140
confusion matrix[[496  41] [ 71 532]]1 2classification report             precision    recall  f1-score   support
         -1       0.88      0.93      0.90       537          1       0.93      0.88      0.91       603
avg / total       0.90      0.90      0.90      1140
confusion matrix[[497  40] [ 70 533]]2 0classification report             precision    recall  f1-score   support
         -1       0.90      0.80      0.85       537          1       0.84      0.92      0.88       603
avg / total       0.87      0.86      0.86      1140
confusion matrix[[431 106] [ 48 555]]2 1classification report             precision    recall  f1-score   support
         -1       0.92      0.91      0.91       537          1       0.92      0.93      0.92       603
avg / total       0.92      0.92      0.92      1140
confusion matrix[[486  51] [ 43 560]]2 2classification report             precision    recall  f1-score   support
         -1       0.93      0.91      0.92       537          1       0.92      0.94      0.93       603
avg / total       0.92      0.92      0.92      1140
confusion matrix[[488  49] [ 39 564]]3 0classification report             precision    recall  f1-score   support
         -1       0.92      0.82      0.87       537          1       0.86      0.94      0.90       603
avg / total       0.89      0.88      0.88      1140
confusion matrix[[443  94] [ 38 565]]3 1classification report             precision    recall  f1-score   support
         -1       0.92      0.91      0.91       537          1       0.92      0.93      0.92       603
avg / total       0.92      0.92      0.92      1140
confusion matrix[[486  51] [ 41 562]]3 2classification report             precision    recall  f1-score   support
         -1       0.88      0.93      0.90       537          1       0.93      0.89      0.91       603
avg / total       0.91      0.91      0.91      1140
confusion matrix[[497  40] [ 67 536]]

复制代码

删除差评之后，不同组合的分类模型效果均有显著提升。这也说明，分类模型能够有效地将好评区分出来。

数据集中存在标注不准确的问题，主要集中在中评。由于人在评论时，除非有问题否则一般都会打好评，如果打了中评说明对产品有不满意之处，在情感的表达上就会趋向于负向情感，同时评论具有很大主观性，很多中评会将其归为差评，但数据集中却认为是中评。因此，将一条评论分类为好评、中评、差评是不够客观，中评与差评之间的边界很模糊，因此识别率很难提高。

5、基于 word2vec 中 doc2vec 的无监督分类模型

开源文本向量化工具 word2vec，可以为文本数据寻求更加深层次的特征表示。词语之间可以进行运算:

w2v(woman)-w2v(man)+w2v(king)=w2v(queen)

基于 word2vec 的 doc2vec，将每个文档表示为一个向量，并且通过余弦距离可以计算两个文档的相似程度，那么就可以计算一句话和一句极好的好评的距离，以及一句话到极差的差评的距离。

在本案例的数据集中：

好评：快就是手感满意也好喜欢也流畅很服务态度实用超快挺快用着速度礼品也不错非常好挺好感觉才来还行好看也快不错的送了非常不错超级赞好多东西很实用各方面挺好的很多漂亮配件还不错也多特意慢满分好用非常漂亮......
差评：不多说上当差差刚用服务差一点也不不要简直还是去实体店大家保证不肯生气开发票磨损后悔印记网什么破烂烂左边失效太骗掉价走下坡路不说了彻底三星手机自营几次真心别的看完简单说机会这是生气了触动缝隙冲动了失望......

我们使用第三方库gensim来实现doc2vec模型。

实现代码如下所示：

import pandas as pdfrom gensim.models import Doc2Vecfrom gensim.models.doc2vec import TaggedDocumentimport logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
train_x = data_bi['Comment'].ravel()train_y = data_bi['Class'].ravel()
#为train_x列贴上标签"TRAIN"def labelizeReviews(reviews, label_type):    labelized = []    for i, v in enumerate(reviews):        label = '%s_%s' % (label_type, i)        labelized.append(TaggedDocument(v.split(" "), [label]))    return labelized

train_x = labelizeReviews(train_x, "TRAIN")
#建立Doc2Vec模型modelsize = 300all_data = []all_data.extend(train_x)
model = Doc2Vec(min_count=1, window=8, size=size, sample=1e-4, negative=5, hs=0, iter=5, workers=8)model.build_vocab(all_data)
# 设置迭代次数10for epoch in range(10):    model.train(train_x)    #建立空列表pos和neg以对相似度计算结果进行存储，计算每个评论和极好评论之间的余弦距离，并存在pos列表中#计算每个评论和极差评论之间的余弦距离，并存在neg列表中pos = []neg = []
for i in range(0,len(train_x)):    pos.append(model.docvecs.similarity("TRAIN_0","TRAIN_{}".format(i)))    neg.append(model.docvecs.similarity("TRAIN_1","TRAIN_{}".format(i)))    #将pos列表和neg列表更新到原始数据文件中，分别表示为字段PosSim和字段NegSimdata_bi[u'PosSim'] = posdata_bi[u'NegSim'] = neg

复制代码

模型训练过程如下所示：

2017-05-27 14:30:28,393 : INFO : collecting all words and their counts2017-05-27 14:30:28,394 : INFO : PROGRESS: at example #0, processed 0 words (0/s), 0 word types, 0 tags2017-05-27 14:30:28,593 : INFO : collected 10545 word types and 5700 unique tags from a corpus of 5700 examples and 482148 words2017-05-27 14:30:28,595 : INFO : Loading a fresh vocabulary2017-05-27 14:30:28,649 : INFO : min_count=1 retains 10545 unique words (100% of original 10545, drops 0)2017-05-27 14:30:28,650 : INFO : min_count=1 leaves 482148 word corpus (100% of original 482148, drops 0)2017-05-27 14:30:28,705 : INFO : deleting the raw counts dictionary of 10545 items2017-05-27 14:30:28,706 : INFO : sample=0.0001 downsamples 217 most-common words2017-05-27 14:30:28,707 : INFO : downsampling leaves estimated 108356 word corpus (22.5% of prior 482148)2017-05-27 14:30:28,709 : INFO : estimated required memory for 10545 words and 300 dimensions: 38560500 bytes2017-05-27 14:30:28,784 : INFO : resetting layer weights2017-05-27 14:30:29,120 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:29,121 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:30,176 : INFO : PROGRESS: at 10.24% examples, 72316 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:31,211 : INFO : PROGRESS: at 29.96% examples, 91057 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:32,218 : INFO : PROGRESS: at 66.30% examples, 126742 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:33,231 : INFO : PROGRESS: at 86.00% examples, 122698 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:33,571 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:33,573 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:33,605 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:33,647 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:33,678 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:33,696 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:33,711 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:33,722 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:33,724 : INFO : training on 2410740 raw words (570332 effective words) took 4.6s, 124032 effective words/s2017-05-27 14:30:33,727 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:33,731 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:34,753 : INFO : PROGRESS: at 36.38% examples, 212225 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:35,762 : INFO : PROGRESS: at 75.24% examples, 216859 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:36,243 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:36,244 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:36,264 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:36,306 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:36,311 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:36,320 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:36,330 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:36,336 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:36,338 : INFO : training on 2410740 raw words (570008 effective words) took 2.6s, 219523 effective words/s2017-05-27 14:30:36,339 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:36,341 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:37,353 : INFO : PROGRESS: at 28.23% examples, 177496 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:38,372 : INFO : PROGRESS: at 66.30% examples, 193880 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:39,061 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:39,062 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:39,074 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:39,115 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:39,122 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:39,132 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:39,147 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:39,154 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:39,155 : INFO : training on 2410740 raw words (570746 effective words) took 2.8s, 203312 effective words/s2017-05-27 14:30:39,158 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:39,159 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:40,168 : INFO : PROGRESS: at 37.74% examples, 222816 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:41,177 : INFO : PROGRESS: at 77.55% examples, 223202 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:41,605 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:41,610 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:41,614 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:41,645 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:41,670 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:41,674 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:41,682 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:41,690 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:41,692 : INFO : training on 2410740 raw words (569889 effective words) took 2.5s, 225457 effective words/s2017-05-27 14:30:41,694 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:41,696 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:42,712 : INFO : PROGRESS: at 29.16% examples, 183182 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:43,754 : INFO : PROGRESS: at 69.96% examples, 203560 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:44,804 : INFO : PROGRESS: at 91.97% examples, 173787 words/s, in_qsize 14, out_qsize 02017-05-27 14:30:44,973 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:44,989 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:45,028 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:45,061 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:45,097 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:45,101 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:45,121 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:45,125 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:45,128 : INFO : training on 2410740 raw words (569903 effective words) took 3.4s, 166370 effective words/s2017-05-27 14:30:45,131 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:45,132 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:46,152 : INFO : PROGRESS: at 11.26% examples, 79348 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:47,153 : INFO : PROGRESS: at 27.52% examples, 85992 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:48,166 : INFO : PROGRESS: at 66.47% examples, 130273 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:49,061 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:49,076 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:49,088 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:49,123 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:49,144 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:49,147 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:49,152 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:49,159 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:49,160 : INFO : training on 2410740 raw words (570333 effective words) took 4.0s, 141860 effective words/s2017-05-27 14:30:49,161 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:49,163 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:50,185 : INFO : PROGRESS: at 31.78% examples, 193530 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:51,244 : INFO : PROGRESS: at 48.51% examples, 141817 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:52,278 : INFO : PROGRESS: at 69.96% examples, 134399 words/s, in_qsize 16, out_qsize 02017-05-27 14:30:52,918 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:52,936 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:52,945 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:52,976 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:52,979 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:52,984 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:52,995 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:52,998 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:52,999 : INFO : training on 2410740 raw words (570031 effective words) took 3.8s, 148864 effective words/s2017-05-27 14:30:53,000 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:53,002 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:54,024 : INFO : PROGRESS: at 34.48% examples, 202424 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:55,035 : INFO : PROGRESS: at 68.58% examples, 201499 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:56,010 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:56,017 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:56,048 : INFO : PROGRESS: at 96.89% examples, 183861 words/s, in_qsize 5, out_qsize 12017-05-27 14:30:56,049 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:56,071 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:56,084 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:56,099 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:56,101 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:56,104 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:56,104 : INFO : training on 2410740 raw words (570328 effective words) took 3.1s, 184129 effective words/s2017-05-27 14:30:56,105 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:56,107 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:57,134 : INFO : PROGRESS: at 33.13% examples, 197730 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:58,140 : INFO : PROGRESS: at 69.96% examples, 206423 words/s, in_qsize 15, out_qsize 02017-05-27 14:30:58,876 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:30:58,883 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:30:58,889 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:30:58,937 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:30:58,949 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:30:58,953 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:30:58,960 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:30:58,967 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:30:58,968 : INFO : training on 2410740 raw words (570312 effective words) took 2.9s, 199922 effective words/s2017-05-27 14:30:58,969 : INFO : training model with 8 workers on 10545 vocabulary and 300 features, using sg=0 hs=0 sample=0.0001 negative=5 window=82017-05-27 14:30:58,970 : INFO : expecting 5700 sentences, matching count from corpus used for vocabulary survey2017-05-27 14:30:59,991 : INFO : PROGRESS: at 32.86% examples, 198045 words/s, in_qsize 16, out_qsize 02017-05-27 14:31:00,993 : INFO : PROGRESS: at 68.23% examples, 201443 words/s, in_qsize 16, out_qsize 02017-05-27 14:31:01,881 : INFO : worker thread finished; awaiting finish of 7 more threads2017-05-27 14:31:01,888 : INFO : worker thread finished; awaiting finish of 6 more threads2017-05-27 14:31:01,907 : INFO : worker thread finished; awaiting finish of 5 more threads2017-05-27 14:31:01,922 : INFO : worker thread finished; awaiting finish of 4 more threads2017-05-27 14:31:01,941 : INFO : worker thread finished; awaiting finish of 3 more threads2017-05-27 14:31:01,948 : INFO : worker thread finished; awaiting finish of 2 more threads2017-05-27 14:31:01,955 : INFO : worker thread finished; awaiting finish of 1 more threads2017-05-27 14:31:01,961 : INFO : worker thread finished; awaiting finish of 0 more threads2017-05-27 14:31:01,962 : INFO : training on 2410740 raw words (570826 effective words) took 3.0s, 191072 effective words/s

复制代码

最后可视化评论分类效果，实现代码如下所示：

from matplotlib import pyplot as plt
label= data_bi['Class'].ravel()values = data_bi[['PosSim' , 'NegSim']].values
plt.scatter(values[:,0], values[:,1], c=label, alpha=0.4)plt.show()

复制代码

效果如下所示：

从上图中可以看到，好评与差评基本上可以通过一条直线区分开（蓝色为差评，红色为好评）

该方法与传统思路完全不同，没有使用词频率，情感词等特征，其优点有：

将数据集映射到了极低维度的空间，只有二维
一种无监督的学习方法，不需要对原始训练数据进行标注
具有普适性，在其他领域也可以用这种方法，只需要先找出该领域极其正和极其负的方法，将其与所有待识别的数据通过 doc2vec 转化为向量计算距离即可

发布于: 2021 年 04 月 07 日阅读数: 54

原文链接:【http://xie.infoq.cn/article/51c9a3a7c3c94a2fd1034a88a】。文章转载请联系作者。

不脱发的程序猿

关注

【研究方向】物联网、嵌入式、AI、Python 2018.02.09 加入

【公众号】美男子玩编程

发布

暂无评论

创作场景

自然语言处理：网购商品评论情感判定

1、项目背景

2、数据集

3、数据预处理

4、基于 SVM 的情感分类模型

5、基于 word2vec 中 doc2vec 的无监督分类模型

不脱发的程序猿

评论