北京大学学报(自然科学版)

一种基于流形距离的中文语块聚类分析方法

雷霖1,熊伟1,景宁1,肖建夫2   

  1. 1. 国防科学技术大学电子科学与工程学院, 长沙 410073; 2. 长江日报报业集团, 武汉 430015;
  • 收稿日期:2012-05-31 出版日期:2013-01-20 发布日期:2013-01-20

A Clustering Chunking Method Based on Manifold Geodesic Distance

LEI Lin1, XIONG Wei1, JING Ning1, XIAO Jianfu2   

  1. 1. College of Electronic Science and Engineering, National University of Defense Technology, Changsha 410073; 2. Changjiang Daily News Group,Wuhan 430015;
  • Received:2012-05-31 Online:2013-01-20 Published:2013-01-20

摘要: 将中文语块分析看做词在句子内部聚类并标记语块类别的过程, 建立了中文语块分析的聚类模型。首先构建词的语法功能空间, 使用ISOMAP方法重构词空间的低维流形嵌入, 进而考察词在低维空间中的分布情况。在使用层次聚类方法分析语块时, 使用流形上的距离替代传统的欧式距离, 在算法复杂度可以接受的范围内, 提高了语块分析效果。

关键词: 语块分析, 流形距离, 层次聚类, 语法功能空间

Abstract: Regarding the Chinese chunker analysis as a procedure of inner-sentence word clustering and chunker type labeling, a grammar function space is constructed at first, and then embedded in a lower dimension space by applying ISOMAP to observe the distribution feature of Chinese word in the embedding space. In the hierarchical clustering algorithm which is aiming at partitioning word into different clusters, the manifold geodesic distance is employed instead of Euclidean distance to measure the similarity between words. The algorithm facilitates the increment of Chinese chunker analysis performance under the condition of appropriate algorithm complexity.

Key words: chunker analysis, manifold geodesic distance, hierarchical clustring, grammar function space

中图分类号: