Acta Scientiarum Naturalium Universitatis Pekinensis

Previous Articles     Next Articles

Error Detection for Statistical Machine Translation Based on Feature Comparison and Maximum Entropy Model Classifier

DU Jinhua, WANG Sha   

  1. Faculty of Automation and Information Engineering, Xi’an University of Technology, Xi’an 710048;
  • Received:2012-05-29 Online:2013-01-20 Published:2013-01-20

基于特征比较和最大熵模型的统计机器翻译错误检测

杜金华,王莎   

  1. 西安理工大学自动化与信息工程学院, 西安 710048;

Abstract: The authors firstly introduce three typical word posterior probabilities (WPP) for error detection and classification, which are fixed position WPP, sliding window WPP, and alignment-based WPP, and analyzes their impact on the detection performance. Then each WPP feature is combined with three linguistic features (Word, POS and LG Parsing knowledge) over the maximum entropy classifier to predict the translation errors. Experimental results on Chinese-to-English NIST datasets show that the influences of different WPP features on the classification error rate (CER) are significant, and the combination of WPP with linguistic features can significantly reduce the CER and improve the prediction capability of the classifier.

Key words: error detection, word posterior probability, linguistic features, maximum entropy classifier

摘要: 首先介绍3种典型的用于翻译错误检测和分类的单词后验概率特征, 即基于固定位置的词后验概率、基于滑动窗的词后验概率和基于词对齐的词后验概率, 分析其对错误检测性能的影响; 然后, 将其分别与语言学特征如词性、词及由LG句法分析器抽取的句法特征等进行组合, 利用最大熵分类器预测翻译错误, 并在汉英NIST数据集上进行实验验证和比较。实验结果表明, 不同的单词后验概率对分类错误率的影响是显著的, 并且在词后验概率基础上加入语言学特征的组合特征可以显著降低分类错误率, 提高译文错误预测性能。

关键词: 错误检测, 词后验概率, 语言学特征, 最大熵分类器

CLC Number: