北京大学学报(自然科学版)

基于MapReduce的中文词性标注CRF模型并行化训练研究

刘滔,雷霖,陈荦,熊伟   

  1. 国防科学技术大学电子科学与工程学院, 长沙 410073;
  • 收稿日期:2012-05-30 出版日期:2013-01-20 发布日期:2013-01-20

A Parallel Training Research of Chinese Part-of-Speech Tagging CRF Model Based on MapReduce

LIU Tao, LEI Lin, CHEN Luo, XIONG Wei   

  1. College of Electronic Science and Engineering, National University of Defense Technology, Changsha 410073;
  • Received:2012-05-30 Online:2013-01-20 Published:2013-01-20

摘要: 针对条件随机场模型面对大规模数据传统训练算法单机处理性能不高的问题, 提出一种基于MapReduce框架的条件随机场模型训练并行化方法, 设计了条件随机场模型特征提取及参数估计的并行算法, 实现了迭代缩放算法的并行。实验表明, 所提出的并行化方法在保证训练结果正确性的同时, 大大减少了训练时间, 效率得到较大提升。

关键词: 词性标注, 条件随机场, MapReduce, 并行

Abstract: Conditional random field (CRF) model bears a major drawback of low training efficiency for large-scale data processing. A parallel method of conditional random field model training based on MapReduce is proposed to solve the problem. The method designs parallel algorithm for feature selection and parameters estimation of CRF model to achieve a parallel iterative scaling algorithm. Experiments show that the method improves the efficiency and reduces time cost significantly while guaranteeing the training result correctness.

Key words: part-of-speech (POS) tagging, conditional random field (CRF), MapReduce, parallel

中图分类号: