北京大学学报自然科学版 ›› 2020, Vol. 56 ›› Issue (1): 97-104.DOI: 10.13209/j.0479-8023.2019.098

上一篇    下一篇

基于神经耦合模型的异构词法数据转化和融合

黄德朋, 李正华, 龚晨, 张民   

  1. 苏州大学计算机科学与技术学院, 苏州 215006
  • 收稿日期:2019-05-22 修回日期:2019-09-19 出版日期:2020-01-20 发布日期:2020-01-20
  • 通讯作者: 李正华, E-mail: zhli13(at)suda.edu.cn
  • 基金资助:
    国家自然科学基金(61525205, 61876116, 61702518)和江苏高校优势学科建设工程项目资助

Neural Network Coupled Model for Conversion and Exploitation of Heterogeneous Lexical Annotations

HUANG Depeng, LI Zhenghua, GONG Chen, ZHANG Min   

  1. School of Computer Science and Technology, Soochow University, Suzhou 215006
  • Received:2019-05-22 Revised:2019-09-19 Online:2020-01-20 Published:2020-01-20
  • Contact: LI Zhenghua, E-mail: zhli13(at)suda.edu.cn

摘要:

为了扩大人工标注数据的规模, 从而提高模型性能, 尝试充分利用已有的异构人工标注数据训练模型参数。将Li等2015年提出的耦合序列标注方法扩展到基于BiLSTM的深度学习框架, 直接在两个异构训练数据上训练参数, 测试阶段则同时预测两个标签序列。在词性标注、分词词性联合标注两个任务上进行大量实验, 结果表明, 与多任务学习方法和传统耦合模型相比, 神经耦合模型在利用词法异构数据方面更优越,在异构数据转化和融合两个场景上都取得更高的性能。

关键词: 耦合模型, BiLSTM, 深度学习, 词性标注, 分词

Abstract:

In order to expand the scale of manual annotated data and thereby improve model performance, we attempt to make full use of existing heterogeneous annotations to learn model parameters. We extend coupled sequence labeling model proposed by Li et al. (2015) under the BiLSTM-based deep learning framework. The neural coupled model learn its parameters directly on two heterogeneous training data, and predicts two optimal sequences simultaneously during the test phase. A lot of experiments have been conducted on the part-of-speech (POS) tagging task and the joint word segmentation and POS (WS&POS) tagging task. The results show that neural coupled approach is superior to other methods for exploiting heterogeneous lexical data, including the multi-task learning method and the traditional discrete-feature coupled model. Neural coupled model achieves higher performance on both scenarios, i.e., annotation conversion and boost the final target-side tagging accuracy by exploiting heterogeneous data.

Key words: coupled model, BiLSTM, deep learning, part-of-speech tagging, word segmentation