北京大学学报(自然科学版)

适用于大规模文本处理的动态密度聚类算法

李霞1,2,蒋盛益2,张倩生2,朱靖2   

  1. 1. 广东外语外贸大学外国语学及应用语言学研究中心, 广州 510420; 2. 广东外语外贸大学思科信息学院, 广州 510006;
  • 收稿日期:2012-06-06 出版日期:2013-01-20 发布日期:2013-01-20

A Dynamic Density-Based Clustering Algorithm Appropriate to Large-Scale Text Processing

LI Xia1,2, JIANG Shengyi2, ZHANG Qiansheng2, ZHU Jing2   

  1. 1. National Key Research Center for Linguistics and Applied Linguistics, Guangdong University of Foreign Studies, Guangzhou 510420; 2. Cisco School of Informatics, Guangdong University of Foreign Studies, Guangzhou 510006;
  • Received:2012-06-06 Online:2013-01-20 Published:2013-01-20

摘要: 针对传统的基于密度的聚类算法对海量数据处理时, 存在参数输入复杂及时间复杂度高的问题, 给出新的密度定义方法, 并在此基础上提出一种只需一个简单输入参数就能动态识别密度不均匀聚类簇的聚类算法, 同时将其扩充为可以处理海量数据的两阶段动态密度聚类算法。在人造数据集、大规模数据集以及中英文文本语料数据集上的实验表明, 所提出的算法具有输入参数简单和聚类效率高的特点, 可以应用于海量文本数据的聚类处理。

关键词: 文本挖掘, 聚类, 海量数据, 动态密度

Abstract: Because of the high time complexity and complicated parameter setting in traditional density-based clustering algorithm, a new density definition is proposed, which just needs one parameter and can find clusters with different densities. The authors also expand the algorithm to a two-stage dynamic density-based clustering algorithm, which can process large-scale text corpus data. Experiments on synthetic dataset, large-scale dataset from UCI, English text corpus and Chinese text corpus show that TSDDBCA algorithm has the characteristic of easy parameter setting and high clustering efficiency, and can be applied to clustering process to large-scale text data.

Key words: text mining, clustering, large-scale data, dynamic density

中图分类号: