Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2021, Vol. 57 ›› Issue (1): 68-74.DOI: 10.13209/j.0479-8023.2020.078

Previous Articles     Next Articles

Research on the Construction and Application of Paraphrase Parallel Corpus

WANG Yasong, LIU Mingtong, ZHANG Yujie, XU Jin’an, CHEN Yufeng   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044
  • Received:2020-06-07 Revised:2020-08-15 Online:2021-01-20 Published:2021-01-20
  • Contact: ZHANG Yujie, E-mail: yjzhang(at)bjtu.edu.cn

复述平行语料构建及其应用方法研究

王雅松, 刘明童, 张玉洁, 徐金安, 陈钰枫   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 通讯作者: 张玉洁, E-mail: yjzhang(at)bjtu.edu.cn
  • 基金资助:
    国家自然科学基金(61876198, 61976015, 61976016)资助

Abstract:

Taking Chinese as the research object, the authors put forward the method to construct large-scale and high-quality paraphrase parallel corpora. The paraphrase data augmentation method include transfering English paraphrase corpus to Chinese, by using the method of translation engines, and manually annotating evaluation data set. Based on the constructed Chinese paraphrase data, the validity of the paraphrase data construction application method is verified in the paraphrase recognition task and natural language inference task. Firstly, the paraphrase recognition data is generated based on the constructed paraphrase corpus, and the attention-based neural network model of sentence matching is pre-trained to capture the paraphrase information. Then, the pre-trained model is applied to the natural language inference task to improve the performance. The experimental results on the open set show that the constructed paraphrase corpus can be effectively applied to the paraphrase recognition task, and the model can learn paraphrase knowledge. When applied to natural language inference task, paraphrase knowledge can effectively improve the accuracy of natural language inference models and verify the effectiveness of paraphrase knowledge for downstream semantic understanding tasks. Meanwhile, the proposed construction method for the paraphrase corpus is language-independent, which can provide more training data for other languages and fields, generate high-quality paraphrase data, and further improve the performance of other tasks.

Key words: paraphrase corpus construction, data augmentation, transfer learning, paraphrase recognition, natural language inference

摘要:

以汉语为研究对象, 提出构建大规模高质量汉语复述平行语料的方法。基于翻译引擎进行复述数据增强, 将英语复述平行语料迁移到汉语中, 同时人工构建汉语复述评测数据集。基于构建的汉语复述数据, 在复述识别和自然语言推理任务中验证复述数据构建及其应用方法的有效性。首先基于复述语料生成复述识别数据集, 预训练基于注意力机制的神经网络句子匹配模型, 训练模型捕获复述信息, 然后将预训练的模型用于自然语言推理任务, 改进其性能。在自然语言推理公开数据集上的评测结果表明, 所构建的复述语料可有效地应用在复述识别任务中, 模型可以学习复述知识。应用在自然语言推理任务中时, 复述知识能有效地提升自然语言推理模型的精度, 从而验证了复述知识对下游语义理解任务的有效性。所提出的复述语料构建方法不依赖语种, 可为其他语言和领域提供更多的训练数据, 生成高质量的复述数据, 改进其他任务的性能。

关键词: 复述语料构建, 数据增强, 迁移学习, 复述识别, 自然语言推理