基于词间关联度度量的维吾尔文本自动切分方法

doi:10.13209/j.0479-8023.2016.023

北京大学学报（自然科学版） ›› 2016, Vol. 52 ›› Issue (1): 155-164.DOI: 10.13209/j.0479-8023.2016.023

基于词间关联度度量的维吾尔文本自动切分方法

吐尔地·托合提, 维尼拉·木沙江, 艾斯卡尔·艾木都拉

新疆大学信息科学与工程学院, 乌鲁木齐830046

收稿日期:2015-06-07 出版日期:2016-01-20 发布日期:2016-01-20
通讯作者: 吐尔地·托合提, E-mail: turdy(at)xju.edu.cn
基金资助:
国家自然科学基金(61262062, 61163033, 61262063, 61562083)和新疆维吾尔自治区高校科研计划重点项目(XJEDU2012I11)资助

Uyghur Text Automatic Segmentation Method Based on Inter-Word Association Degree Measuring

Turdi Tohti, Winira Musajan, Askar Hamdulla

School of Information Science and Engineering, Xinjiang University, Urumqi 830046

Received:2015-06-07 Online:2016-01-20 Published:2016-01-20
Contact: Turdi Tohti, E-mail: turdy(at)xju.edu.cn

摘要/Abstract

摘要：

提出一种基于词间关联度度量的维吾尔文本自动切分方法。该方法从大规模生语料库中自动获取维吾尔文单词Bi-gram及上下文语境信息, 在充分考虑维吾尔文单词间结合规则的前提下, 将相邻单词间的互信息、t-测试差及双词邻接对熵的线性融合作为组合统计量(dmd), 度量文本中相邻单词之间的关联程度。以dmd度量的弱关联的词间位置作为切分点进行自动切分, 得到语义及结构完整的词串, 而不仅仅是以空格隔开的单词。在大规模文本语料上进行的测试表明, 该方法的切分准确率达到88.21%。

关键词: 语义串, 互信息, t-测试差, 邻接对熵, 单词结合规则

Abstract:

This paper puts forward a new idea and related algorithms for Uyghur segmentation. The word based Bi-gram and contextual information are derived from large scale raw corpus automatically, and according to the Uyghur word association rules, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement to estimate the association strength between two adjacent Uyghur words. The weakly associated inter-word position is taken as a segmentation point and the perfect word strings both on its semantics and structural integrity, not just the words separated by spaces, is obtained. The experimental result on large-scale corpus shows that the proposed algorithm achieves 88.21% segmentation accuracy.

Key words: semantic string, mutual information, difference of t-test, dual adjacent entropy, word association rules

中图分类号:

TP391

吐尔地·托合提, 维尼拉·木沙江, 艾斯卡尔·艾木都拉. 基于词间关联度度量的维吾尔文本自动切分方法[J]. 北京大学学报（自然科学版）, 2016, 52(1): 155-164.

Turdi Tohti, Winira Musajan, Askar Hamdulla. Uyghur Text Automatic Segmentation Method Based on Inter-Word Association Degree Measuring[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 155-164.

导出引用管理器 EndNote|Ris|BibTeX

链接本文: https://xbna.pku.edu.cn/CN/10.13209/j.0479-8023.2016.023

https://xbna.pku.edu.cn/CN/Y2016/V52/I1/155

[1]	杜丽萍, 李晓戈, 于根, 刘春丽, 刘睿. 基于互信息改进算法的新词发现对中文分词系统改进[J]. 北京大学学报（自然科学版）, 2016, 52(1): 35-40.
[2]	余伟,王明文,万剑怡,左家莉. 结合语义的位置语言模型[J]. 北京大学学报（自然科学版）, 2013, 49(2): 203-212.

基于词间关联度度量的维吾尔文本自动切分方法

Uyghur Text Automatic Segmentation Method Based on Inter-Word Association Degree Measuring

RichHTML

PDF

PDF (翻译版)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 2

编辑推荐

Metrics

留言