Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2016, Vol. 52 ›› Issue (1): 155-164.DOI: 10.13209/j.0479-8023.2016.023
Previous Articles Next Articles
Turdi Tohti, Winira Musajan, Askar Hamdulla
Received:
Online:
Published:
Contact:
吐尔地·托合提, 维尼拉·木沙江, 艾斯卡尔·艾木都拉
通讯作者:
基金资助:
Abstract:
This paper puts forward a new idea and related algorithms for Uyghur segmentation. The word based Bi-gram and contextual information are derived from large scale raw corpus automatically, and according to the Uyghur word association rules, the liner combinations of mutual information, difference of t-test and dual adjacent entropy are taken as a new measurement to estimate the association strength between two adjacent Uyghur words. The weakly associated inter-word position is taken as a segmentation point and the perfect word strings both on its semantics and structural integrity, not just the words separated by spaces, is obtained. The experimental result on large-scale corpus shows that the proposed algorithm achieves 88.21% segmentation accuracy.
Key words: semantic string, mutual information, difference of t-test, dual adjacent entropy, word association rules
摘要:
提出一种基于词间关联度度量的维吾尔文本自动切分方法。该方法从大规模生语料库中自动获取维吾尔文单词Bi-gram及上下文语境信息, 在充分考虑维吾尔文单词间结合规则的前提下, 将相邻单词间的互信息、t-测试差及双词邻接对熵的线性融合作为组合统计量(dmd), 度量文本中相邻单词之间的关联程度。以dmd度量的弱关联的词间位置作为切分点进行自动切分, 得到语义及结构完整的词串, 而不仅仅是以空格隔开的单词。在大规模文本语料上进行的测试表明, 该方法的切分准确率达到88.21%。
关键词: 语义串, 互信息, t-测试差, 邻接对熵, 单词结合规则
CLC Number:
TP391
Turdi Tohti, Winira Musajan, Askar Hamdulla. Uyghur Text Automatic Segmentation Method Based on Inter-Word Association Degree Measuring[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2016, 52(1): 155-164.
吐尔地·托合提, 维尼拉·木沙江, 艾斯卡尔·艾木都拉. 基于词间关联度度量的维吾尔文本自动切分方法[J]. 北京大学学报(自然科学版), 2016, 52(1): 155-164.
Add to citation manager EndNote|Ris|BibTeX
URL: https://xbna.pku.edu.cn/EN/10.13209/j.0479-8023.2016.023
https://xbna.pku.edu.cn/EN/Y2016/V52/I1/155