Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2022, Vol. 58 ›› Issue (1): 45-53.DOI: 10.13209/j.0479-8023.2021.110

Previous Articles     Next Articles

Multi-modality Paraphrase Generation Model Integrating Image Information

MA Chao, WAN Zhang, ZHANG Yujie, XU Jin’an, CHEN Yufeng   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044
  • Received:2021-06-09 Revised:2021-08-17 Online:2022-01-20 Published:2022-01-20
  • Contact: ZHANG Yujie, E-mail: yjzhang(at)


马超, 万璋, 张玉洁, 徐金安, 陈钰枫   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 通讯作者: 张玉洁, E-mail: yjzhang(at)
  • 基金资助:
    国家自然科学基金(61876198, 61976015, 61976016)资助


In multi-modality scenarios such as commodity descriptions and news comments, existing paraphrase generation models can not utilize information from image and therefore result in the loss of semantics in the generated paraphrases. In order to solve this problem, this paper first propose the Multi-modality Paraphrase Generation (MPG) model to integrate image information for paraphrase generation. In MPG, in order to integrate the image information corresponding to the original sentence, the authors first construct an abstract scene graph and transform the image features into node features of the scene graph. Furthermore, the constructed scene graph was utilized to generate paraphrase, by using the relational graph convolutional neural network for encoder and graph-based attention mechanism for decoder. In the evaluation stage, a sentence pair similarity calculation method was proposed to select sentence pairs describing same objects from the MSCOCO data set, and then evaluation experiments were conducted. Experimental results show that the proposed MPG model achieve better semantic fidelity, which indicates that the integration of image information is effective in improving the quality of the paraphrase generation in multi-modality scenarios.

Key words: paraphrase generation, multi-modality, abstract scene graph, attention mechansim


在商品描述、新闻评论等多模态场景下, 已有复述生成模型只能围绕文本信息生成复述。为了解决其因无法利用图像信息而导致的语义丢失问题, 提出多模态复述生成模型(multi-modality paraphrase generation model, MPG)来引入图像信息, 并用其生成复述。在MPG中, 为了引入与原句对应的图像信息, 首先根据原句构建抽象场景图, 并将与原句相关联的图像区域特征转换为场景图的结点特征。进一步地, 为了利用构建好的场景图来生成语义一致的复述句, 使用关系图卷积神经网络和基于图的注意力机制对图结点特征进行编码和解码。在评测阶段, 提出句对相似度计算方法, 从MSCOCO数据集中筛选出描述图像中相同物体的句对, 并将其作为复述测试集进行评测。实验结果显示, 所提出的MPG模型生成的复述拥有更好的语义忠实度, 表明在多模态场景下图像信息的引入对提高复述生成质量的有效性。

关键词: 复述生成, 多模态, 抽象场景图, 注意力机制