北京大学学报(自然科学版)

面向词性标注的多资源转化研究

高恩婷1,巢佳媛2,李正华2   

  1. 1. 苏州科学技术学院电子与信息工程学院, 苏州 215011; 2. 苏州大学计算机科学与技术学院, 苏州 215006;
  • 收稿日期:2014-06-30 出版日期:2015-03-20 发布日期:2015-03-20

Conversion of Multiple Resources for POS Tagging

GAO Enting1, CHAO Jiayuan2, LI Zhenghua2   

  1. 1. College of Electronics & Information Engineering, Suzhou University of Science and Technology, Suzhou 215011; 2. School of Computer Science & Technology, Soochow University, Suzhou 215006;
  • Received:2014-06-30 Online:2015-03-20 Published:2015-03-20

摘要: 利用多资源转化方法进行词性标注研究, 旨在将源端资源的标注进行转化, 以符合目标端标注规范, 进而将转化后的资源与目标资源合并, 增大训练数据规模。做了两方面创新: 在转化过程中, 额外利用指导特征的置信度信息; 在转化后的资源中, 用模糊标注表示方法减少错误标注。实验表明, 利用置信度信息能有效帮助转化, 而模糊标注表示方法的影响不大。

关键词: 词性标注转化, 条件随机场, 词性标注

Abstract: The authors propose an annotation conversion method using multiple resources for POS tagging, aiming to convert the source-side annotations into target-side and then combine the data to get larger training data. Two innovate strategies are proposed. The first strategy uses reliability information of guide features. The second strategy uses ambiguous labelings to improve the quality of converted data. Results demonstrate that the first strategy is helpful for annotation conversion while the second does little to conversion.

Key words: annotation conversion, conditional random field, POS tagging

中图分类号: