当前位置：文档之家› 结合语义改进的K-means短文本聚类算法

结合语义改进的K-means短文本聚类算法

Computer Engineering and Applications 计算机工程与应用

2016，52（19）1引言短文本数据存在于当今广泛使用的微信、微博和论坛等系统中，短文本数据在传递公开信息的同时携带了丰富的用户信息，从而成为一种新的具有极大价值的信息资源，对于此类数据的聚类可以有效应用于热点问题的发现、用户的情感分析、商品的个性化推荐等领域。目前，针对短文本特征信息少，只有少量的字可以被分析使用的特性，国内外对短文本聚类[1-2]已经进行了比较深入的研究，文献[3]提出通过熵和文本词集群来获取短文本的特征关键词，相对于传统的基于词频的特

征提取方法，该算法显著提高了特征关键词的提取质量。但是对于特征关键词非常稀疏的短文本，该方法无法得到理想的聚类效果。文献[4]通过对词频建模来实现短文本聚类，通过大量数据集训练的模型可以得到很好的聚类效果。文献[5]提出了一种离散粒子群优化算

法，该算法对于规模不大的短文本语料库有很好的聚类效果，但是随着语料库规模的增大其聚类结果的F 值会显著降低。文献[6]提出了一种基于免疫的中文网络结合语义改进的K -means 短文本聚类算法

邱云飞，赵彬，林明明，王伟

QIU Yunfei,ZHAO Bin,LIN Mingming,WANG Wei

辽宁工程技术大学软件学院，辽宁葫芦岛125105

School of Software,Liaoning Technical University,Huludao,Liaoning 125105,China

QIU Yunfei,ZHAO Bin,LIN Mingming,et al.Improved K -means clustering algorithm combined semantic similarity of short https://www.doczj.com/doc/6b15972958.html,puter Engineering and Applications,2016,52（19）：78-83.

Abstract ：Nowadays,there are three major challenges for short text clustering,the sparsity of feature key,the complexity of processing in high-dimensional space and the comprehensibility of clusters.For these challenges,a K -means clustering algorithm is proposed,which is improved by combining with semantic.Short text is described by collection of words in this algorithm,it alleviates the sparsity problem of characteristics of short text keywords.The clustering center can be obtained by mining the maximum frequent word set of short text collection,which effectively overcomes the defect that K -means clustering algorithm is sensitive to the initial clustering center,it solves the problem of the comprehensibility of clusters,and avoids the operation in high-dimensional space.The experimental results show that short text clustering algo-rithm combined with semantic is better than traditional algorithms.

Key words ：text mining;clustering of short text;K -means algorithm;maximum frequent word set;HowNet;semantic similarity 摘要：针对短文本聚类存在的三个主要挑战，特征关键词的稀疏性、高维空间处理的复杂性和簇的可理解性，提出了一种结合语义改进的K -means 短文本聚类算法。该算法通过词语集合表示短文本，缓解了短文本特征关键词的稀疏性问题；通过挖掘短文本集的最大频繁词集获取初始聚类中心，有效克服了K -means 聚类算法对初始聚类中心敏感的缺点，解决了簇的理解性问题；通过结合TF-IDF 值的语义相似度计算文档之间的相似度，避免了高维空间的运算。实验结果表明，从语义角度出发实现的短文本聚类算法优于传统的短文本聚类算法。

关键词：文本挖掘；短文本聚类；K -means 算法；最大频繁词集；知网；语义相似度

文献标志码：A 中图分类号：TP391.1doi ：10.3778/j.issn.1002-8331.1412-0418

基金项目：国家自然科学基金（No.71371091）；辽宁省高等学校杰出青年学者成长计划（No.LJQ2012027）；辽宁省教育厅一般项目

（No.L2013131）。

作者简介：邱云飞（1976—），男，博士，教授，CCF 会员，主要研究领域为数据挖掘、情感分析；赵彬（1990—），通讯作者，男，硕士研

究生，主要研究领域为数据挖掘、文本挖掘、大数据挖掘，E-mail ：binsanity@https://www.doczj.com/doc/6b15972958.html, ；林明明（1989—），女，硕士研究生，主要研究领域为数据挖掘、情感分析；王伟（1981—），男，讲师，主要研究领域为数据挖掘、算法设计与分析等。

收稿日期：2015-01-06修回日期：2015-03-17文章编号：1002-8331（2016）19-0078-06

CNKI 网络优先出版：2015-06-24,https://www.doczj.com/doc/6b15972958.html,/kcms/detail/11.2127.TP.20150624.1129.028.html

78万方数据