北京大学学报(自然科学版)

藏文数词识别与翻译

孙萌1,2,华却才让3,刘凯1,吕雅娟1,刘群1   

  1. 1. 中国科学院计算技术研究所, 北京 100190; 2. 中国科学院研究生院, 北京 100049; 3. 青海师范大学藏文信息研究中心, 西宁 810008;
  • 收稿日期:2012-06-05 出版日期:2013-01-20 发布日期:2013-01-20

Tibetan Number Identification and Translation

SUN Meng1,2, HUA Quecairang3, LIU Kai1, Lü Yajuan1, LIU Qun1   

  1. 1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190; 2. Graduate University, Chinese Academy of Sciences, Beijing 100049; 3. Tibetan Information Research Center, QingHai Normal University, Xining 810008;
  • Received:2012-06-05 Online:2013-01-20 Published:2013-01-20

摘要: 通过对藏文数词内部构词规律及外部边界信息进行分析, 提出对藏文数词基本构件定义的方案。采取最优路径决策模型判断数词构件边界, 然后通过有限自动机模型识别并翻译基本数词, 最后用模板匹配算法处理复杂数词。结果表明,提出的方法对数词识别与翻译的F值达到98.73%, 在藏汉机器翻译的测试集上的BLEU提高了2.64%。

关键词: 藏文, 数词基本构件, 自动机, 数词识别, 数词翻译

Abstract: The authors propose a definition of Tibetan number basic component through analyzing the inner structure and the boundary information. A best path decision was applied in judging basic component, then the number was recognized and translated by a finite automation model, finally a template matching algorithm was used for processing complicated number. The F-score of identification and translation is 98.73% and the BLEU score of Tibetan-Chinese translation obtains an improvement of 2.64%.

Key words: Tibetan, number basic component, automation, number indentification, number translation

中图分类号: