北京大学学报(自然科学版) ›› 2016, Vol. 52 ›› Issue (1): 155-164.DOI: 10.13209/j.0479-8023.2016.023

上一篇    下一篇

基于词间关联度度量的维吾尔文本自动切分方法

吐尔地·托合提, 维尼拉·木沙江, 艾斯卡尔·艾木都拉   

  1. 新疆大学信息科学与工程学院, 乌鲁木齐830046
  • 收稿日期:2015-06-07 出版日期:2016-01-20 发布日期:2016-01-20
  • 通讯作者: 吐尔地·托合提, E-mail: turdy(at)xju.edu.cn
  • 基金资助:
    国家自然科学基金(61262062, 61163033, 61262063, 61562083)和新疆维吾尔自治区高校科研计划重点项目(XJEDU2012I11)资助

Uyghur Text Automatic Segmentation Method Based on Inter-Word Association Degree Measuring

Turdi Tohti, Winira Musajan, Askar Hamdulla   

  1. School of Information Science and Engineering, Xinjiang University, Urumqi 830046
  • Received:2015-06-07 Online:2016-01-20 Published:2016-01-20
  • Contact: Turdi Tohti, E-mail: turdy(at)xju.edu.cn

摘要:

提出一种基于词间关联度度量的维吾尔文本自动切分方法。该方法从大规模生语料库中自动获取维吾尔文单词Bi-gram及上下文语境信息, 在充分考虑维吾尔文单词间结合规则的前提下, 将相邻单词间的互信息、t-测试差及双词邻接对熵的线性融合作为组合统计量(dmd), 度量文本中相邻单词之间的关联程度。以dmd度量的弱关联的词间位置作为切分点进行自动切分, 得到语义及结构完整的词串, 而不仅仅是以空格隔开的单词。在大规模文本语料上进行的测试表明, 该方法的切分准确率达到88.21%。

关键词: 语义串, 互信息, t-测试差, 邻接对熵, 单词结合规则

Abstract:

This paper puts forward a new idea and related algorithms for Uyghur segmentation. The word based Bi-gram and contextual information are derived from large scale raw corpus automatically, and according to the Uyghur word association rules, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement to estimate the association strength between two adjacent Uyghur words. The weakly associated inter-word position is taken as a segmentation point and the perfect word strings both on its semantics and structural integrity, not just the words separated by spaces, is obtained. The experimental result on large-scale corpus shows that the proposed algorithm achieves 88.21% segmentation accuracy.

Key words: semantic string, mutual information, difference of t-test, dual adjacent entropy, word association rules

中图分类号: