Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2021, Vol. 57 ›› Issue (1): 75-82.DOI: 10.13209/j.0479-8023.2020.080

Previous Articles     Next Articles

Object Space Relation Mechanism Fused Image Caption Method

WAN Zhang, ZHANG Yujie, LIU Mingtong, XU Jin’an, CHEN Yufeng   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044
  • Received:2020-06-09 Revised:2020-08-14 Online:2021-01-20 Published:2021-01-20
  • Contact: ZHANG Yujie, E-mail: yjzhang(at)


万璋, 张玉洁, 刘明童, 徐金安, 陈钰枫   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 通讯作者: 张玉洁, E-mail: yjzhang(at)
  • 基金资助:
    国家自然科学基金(61876198, 61976015, 61976016)资助


Focusing on the specific information of the positional relationship between objects in the image, a neural network image summary generation model integrating spatial relationship mechanism is proposed, in order to provide key information (object position or trajectory) for downstream tasks such as visual question answering and voice navigation. In order to enhance the learning ability of the positional relationship between objects of the image encoder, the geometric attention mechanism is introduced by improving the Transformer structure, and the positional relationship between objects is explicitly integrated into the appearance information of the objects. In order to assist in the completion of specific information-oriented extraction and summary generation tasks, a data production method for relative position relations is further proposed, and the image abstract data set Re-Position of the position relations between objects is produced based on the SpatialSense data set. The experimental results of comparative evaluation with five typical models show that the five indicators of the proposed model are better than those of other models on the public test set COCO, and all six indicators are better than those of other models on Re-Position data set.

Key words: image caption, positional relationship between objects, attention mechanism, Transformer structure


聚焦于图像中物体间位置关系这一特定信息, 提出一种融合空间关系机制的神经网络图像摘要生成模型, 以期为视觉问答和语音导航等下游任务提供物体方位或轨迹等关键信息。为了增强图像编码器的物体间位置关系学习能力, 通过改进Transformer结构来引入几何注意力机制, 显式地将物体间位置关系融合进物体外观信息中。为了辅助完成面向特定信息的抽取和摘要生成任务, 进一步提出相对位置关系的数据制作方法, 并基于SpatialSense数据集制作物体间位置关系的图像摘要数据集Re-Position。与5个典型模型的对比测评实验结果表明, 所提模型的5个指标在公开测试集COCO上优于其他模型, 全部6个指标在本文制作的Re-Position数据集上优于其他模型。

关键词: 图像摘要, 物体间位置关系, 注意力机制, Transformer结构