北京大学学报自然科学版 ›› 2022, Vol. 58 ›› Issue (1): 91-98.DOI: 10.13209/j.0479-8023.2021.106

上一篇    下一篇

融合双通道音节特征的藏文La格例句自动分类模型

班玛宝1,2,3, 才让加1,2,3,4,5,†, 张瑞1,2,3, 色差甲1,2,3, 卓玛扎西1,2,3   

  1. 1. 青海师范大学计算机学院, 西宁 810016 2. 藏语智能信息处理及应用国家重点实验室, 西宁 810008 3. 青海省藏文信息处理工程技术研究中心, 西宁 810008 4. 青海省藏文信息处理与机器翻译重点实验室, 西宁 810008 5. 藏文信息处理教育部重点实验室, 西宁 810008
  • 收稿日期:2021-06-12 修回日期:2021-08-07 出版日期:2022-01-20 发布日期:2022-01-20
  • 通讯作者: 才让加, E-mail: zwxxzx(at)163.com
  • 基金资助:
    国家自然科学基金(61662061, 61063033, 61966031)、国家重点研发计划(2017YFB1402200)、青海省藏文信息处理与机器翻译重点实验室项目(2020-ZJ-Y05)、青海省科技厅项目(2019-SF-129)和青海省重点实验室项目(2013-Z-Y17, 2014-Z-Y32, 2015-Z-Y03)资助

An Automatic Classification Model of Tibetan La Case Example Sentences with Fusion Dual-channel Syllable Features

BAN Mabao1,2,3, CAI Rangjia1,2,3,4,5,†, ZHANG Rui1,2,3, SE Chajia1,2,3, ZHUO Mazhaxi1,2,3   

  1. 1. College of Computer Science and Technology, Qinghai Normal University, Xining 810016 2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008 3. Tibetan Information Processing Engineering Technology and Research Center of Qinghai Province, Xining 810008 4. Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province, Xining 810008 5. Key Laboratory of Tibetan Information Processing, Ministry of Education, Xining 810008
  • Received:2021-06-12 Revised:2021-08-07 Online:2022-01-20 Published:2022-01-20
  • Contact: CAI Rangjia, E-mail: zwxxzx(at)163.com

摘要:

基于藏文La格例句的自动分类在藏语自然语言处理领域的重要性, 根据藏文La格的用法和添接规则, 在对藏文La格例句进行分类并定义分类概念的基础上, 提出一种融合双通道音节特征的藏文La格例句自动分类模型。该模型首先使用word2vec和 Glove构建双通道藏文音节嵌入, 分别在每路卷积中融合双通道音节特征, 丰富输入特征的表达和提高卷积层的空间表征能力; 然后在每一路卷积均使用结合层级注意力机制的Bi-LSTM学习时序特征后, 拼接多路特征, 提高上下文时序特征的学习能力; 最后通过全链接层和Softmax层实现藏文La格例句自动分类。实验结果表明, 该模型在测试集上的藏文La格例句分类准确率达到 90.26%。

关键词: 自然语言处理, 双通道音节特征, 藏文La格例句, 自动分类

Abstract:

Based on the importance of automatic classification of Tibetan La case example sentences in Tibetan natural language processing, according to the usage and adding rules of Tibetan La case, this paper classifies Tibetan La case example sentences and defines the classification concept, and proposes an automatic classification model of Tibetan La case example sentences with fusion dual-channel syllable features. The proposed model first uses word2vec and Glove to construct a dual-channel Tibetan syllable embedding, and combines the dual-channel syllable features in each convolution respectively to enrich the expression of input features and improve the spatial representation ability of the convolutional layer. Then in each convolution, the Bi-LSTM combined with the hierarchical attention mechanism is used to learn the timing features, and the multi-channel features are spliced to improve the learning ability of the context timing features. Finally, the automatic classification of Tibetan La case example sentences is realized through the full link layer and the Softmax layer. Experiments show that proposed model has an accuracy of 90.26% in the classification of Tibetan La case example sentences on the test set.

Key words: NLP, dual-channel syllabic features, Tibetan La case example sentences, automatic classification