北京大学学报自然科学版 ›› 2019, Vol. 55 ›› Issue (1): 47-54.DOI: 10.13209/j.0479-8023.2018.067

上一篇    下一篇

字符级的维吾尔语形态协同分析方法

吐尔洪·吾司曼1,2.3, 杨雅婷1,2,3, 艾孜孜·吐尔逊4, 程力1,2,3,†   

  1. 1. 中国科学院新疆理化技术研究所, 乌鲁木齐 830011
    2. 中国科学院大学, 北京 100049
    3. 新疆民族语音语言信息处理实验室, 乌鲁木齐 830011
    4. 和田师范专科学校数学与信息学院, 和田 848000
  • 收稿日期:2018-04-15 修回日期:2018-08-09 出版日期:2019-01-20 发布日期:2019-01-20
  • 通讯作者: 程力, E-mail: chengli(at)ms.xjb.ac.cn
  • 基金资助:
    中国科学院“西部之光”人才培养计划基金(2017-XBZG-BR-001, 2017-XBQNZ-A-005)、国家“千人计划”项目(Y32H251201)、国家自然科学基金(U1703133)和中国科学院青年创新促进会基金(2017472)资助??

Collaborative Analysis of Uyghur Morphology Based on Character Level

Turghun Osman1,2,3, YANG Yating1,2,3, Eziz Tursun4, CHENG Li1,2,3,†   

  1. 1. Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Science, Urumqi 830011
    2. University of Chinese Academy of Science, Beijing 100049
    3. Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011
    4. Institute of Mathematics and Information of Hotan Teachers College, Hotan 848000
  • Received:2018-04-15 Revised:2018-08-09 Online:2019-01-20 Published:2019-01-20
  • Contact: CHENG Li, E-mail: chengli(at)ms.xjb.ac.cn

摘要:

针对维吾尔语中构形词缀种类多、构形复杂以及发生音变现象等问题, 提出一种基于字符级的维吾尔语形态协同分析方法。该方法最大的特点是同时进行维吾尔语的形态切分、形态标注以及音变还原, 将词素边界、形态标记以及音变信息用一个复合标记描述, 采用字符序列的标注方法进行训练。实验结果显示, 形态切分、形态标注及音变还原的正确率分别达到96.39%, 92.78%和99.79%, 系统总体正确率达92.59%。

关键词: 维吾尔语, 形态分析, 协同分析

Abstract:

The Uyghur language has various inflectional affixes, complex structures and phonetic changes. The authors propose a collaborative analysis method for Uyghur morphology at character level. It includes three procedures: morpheme segmentation, morphological annotation and reduction of phonetic changes. The main characteristics of this method is to use a composite tag to represent the morpheme boundaries, annotations and phonetic changes. In addition, character sequence annotation is used to train the model. Experimental results show that the accurency of morpheme segmentation, morphological annotation and reduction of phonetic reaches 96.39%, 92.78% and 99.79% respectively. The overall accuracy of the system reaches 92.59%.

Key words: Uyghur, morphological analysis, collaborative analysis