复述平行语料构建及其应用方法研究

doi:10.13209/j.0479-8023.2020.078

北京大学学报自然科学版 ›› 2021, Vol. 57 ›› Issue (1): 68-74.DOI: 10.13209/j.0479-8023.2020.078

复述平行语料构建及其应用方法研究

王雅松, 刘明童, 张玉洁^†, 徐金安, 陈钰枫

北京交通大学计算机与信息技术学院, 北京 100044

收稿日期:2020-06-07 修回日期:2020-08-15 出版日期:2021-01-20 发布日期:2021-01-20
通讯作者: 张玉洁, E-mail: yjzhang(at)bjtu.edu.cn
基金资助:
国家自然科学基金(61876198, 61976015, 61976016)资助

Research on the Construction and Application of Paraphrase Parallel Corpus

WANG Yasong, LIU Mingtong, ZHANG Yujie^†, XU Jin’an, CHEN Yufeng

School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044

Received:2020-06-07 Revised:2020-08-15 Online:2021-01-20 Published:2021-01-20
Contact: ZHANG Yujie, E-mail: yjzhang(at)bjtu.edu.cn

摘要/Abstract

摘要：

以汉语为研究对象, 提出构建大规模高质量汉语复述平行语料的方法。基于翻译引擎进行复述数据增强, 将英语复述平行语料迁移到汉语中, 同时人工构建汉语复述评测数据集。基于构建的汉语复述数据, 在复述识别和自然语言推理任务中验证复述数据构建及其应用方法的有效性。首先基于复述语料生成复述识别数据集, 预训练基于注意力机制的神经网络句子匹配模型, 训练模型捕获复述信息, 然后将预训练的模型用于自然语言推理任务, 改进其性能。在自然语言推理公开数据集上的评测结果表明, 所构建的复述语料可有效地应用在复述识别任务中, 模型可以学习复述知识。应用在自然语言推理任务中时, 复述知识能有效地提升自然语言推理模型的精度, 从而验证了复述知识对下游语义理解任务的有效性。所提出的复述语料构建方法不依赖语种, 可为其他语言和领域提供更多的训练数据, 生成高质量的复述数据, 改进其他任务的性能。

关键词: 复述语料构建, 数据增强, 迁移学习, 复述识别, 自然语言推理

Abstract:

Taking Chinese as the research object, the authors put forward the method to construct large-scale and high-quality paraphrase parallel corpora. The paraphrase data augmentation method include transfering English paraphrase corpus to Chinese, by using the method of translation engines, and manually annotating evaluation data set. Based on the constructed Chinese paraphrase data, the validity of the paraphrase data construction application method is verified in the paraphrase recognition task and natural language inference task. Firstly, the paraphrase recognition data is generated based on the constructed paraphrase corpus, and the attention-based neural network model of sentence matching is pre-trained to capture the paraphrase information. Then, the pre-trained model is applied to the natural language inference task to improve the performance. The experimental results on the open set show that the constructed paraphrase corpus can be effectively applied to the paraphrase recognition task, and the model can learn paraphrase knowledge. When applied to natural language inference task, paraphrase knowledge can effectively improve the accuracy of natural language inference models and verify the effectiveness of paraphrase knowledge for downstream semantic understanding tasks. Meanwhile, the proposed construction method for the paraphrase corpus is language-independent, which can provide more training data for other languages and fields, generate high-quality paraphrase data, and further improve the performance of other tasks.

Key words: paraphrase corpus construction, data augmentation, transfer learning, paraphrase recognition, natural language inference

王雅松, 刘明童, 张玉洁, 徐金安, 陈钰枫. 复述平行语料构建及其应用方法研究[J]. 北京大学学报（自然科学版）, 2021, 57(1): 68-74.

WANG Yasong, LIU Mingtong, ZHANG Yujie, XU Jin’an, CHEN Yufeng. Research on the Construction and Application of Paraphrase Parallel Corpus[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2021, 57(1): 68-74.

导出引用管理器 EndNote|Ris|BibTeX

链接本文: https://xbna.pku.edu.cn/CN/10.13209/j.0479-8023.2020.078

https://xbna.pku.edu.cn/CN/Y2021/V57/I1/68

[1]	张乃洲, 曹薇. 基于交叉注意力多源数据增强的情境感知查询建议方法[J]. 北京大学学报自然科学版, 2024, 60(1): 34-42.
[2]	曲威名, 刘天林, 林惟凯, 罗定生. 机器人学习方法综述[J]. 北京大学学报自然科学版, 2023, 59(6): 1069-1086.
[3]	丁家杰, 肖康, 叶恒, 周夏冰, 张民. 面向问答领域的数据增强方法[J]. 北京大学学报自然科学版, 2022, 58(1): 54-60.
[4]	贾云龙, 韩东红, 林海原, 王国仁, 夏利. 面向微博用户的消费意图识别算法[J]. 北京大学学报自然科学版, 2020, 56(1): 68-74.
[5]	刘明童, 张玉洁, 徐金安, 陈钰枫. 基于句法结构的神经网络复述识别模型[J]. 北京大学学报自然科学版, 2020, 56(1): 45-52.
[6]	廖祥文, 吴晓静, 桂林, 黄锦辉, 陈国龙. 结合表示学习和迁移学习的跨领域情感分类[J]. 北京大学学报自然科学版, 2019, 55(1): 37-46.

复述平行语料构建及其应用方法研究

Research on the Construction and Application of Paraphrase Parallel Corpus

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

留言