北京大学学报自然科学版 ›› 2024, Vol. 60 ›› Issue (1): 23-33.DOI: 10.13209/j.0479-8023.2023.072

上一篇    下一篇

一种消减多模态偏见的鲁棒视觉问答方法

张丰硕, 李豫, 李向前, 徐金安, 陈钰枫   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 收稿日期:2023-05-18 修回日期:2023-09-26 出版日期:2024-01-20 发布日期:2024-01-20
  • 通讯作者: 李向前, E-mail: xqli(at)bjtu.edu.cn

Reducing Multi-model Biases for Robust Visual Question Answering

ZHANG Fengshuo, LI Yu, LI Xiangqian, XU Jin’an, CHEN Yufeng   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044
  • Received:2023-05-18 Revised:2023-09-26 Online:2024-01-20 Published:2024-01-20
  • Contact: LI Xiangqian, E-mail: xqli(at)bjtu.edu.cn

摘要:

为了增强视觉问答模型的鲁棒性, 提出一种偏见消减方法, 并在此基础上探究语言与视觉信息对偏见的影响。进一步地, 构造两个偏见学习分支来分别捕获语言偏见以及语言和图片共同导致的偏见, 利用偏见消减方法, 得到鲁棒性更强的预测结果。最后, 依据标准视觉问答与偏见分支之间的预测概率差异, 对样本进行动态赋权, 使模型针对不同偏见程度的样本动态地调节学习程度。在VQA-CP v2.0等数据集上的实验结果证明了所提方法的有效性, 缓解了偏见对模型的影响。

关键词: 视觉问答, 数据集偏差, 语言偏见, 深度学习

Abstract:

In order to enhance the robustness of the visual question answering model, a bias reduction method is proposed. Based on this, the influence of language and visual information on bias effect is explored. Furthermore, two bias learning branches are constructed to capture the language bias, and the bias caused by both language and images. Then, more robust prediction results are obtained by using the bias reduction method. Finally, based on the difference in prediction probabilities between standard visual question answering and bias branches, samples are dynamically weighted, allowing the model to adjust learning levels for samples with different levels of bias. Experiments on VQA-CP v2.0 and other data sets demonstrate the effectiveness of the proposed method and alleviate the influence of bias on the model.

Key words: visual question answering, dataset bias, language bias, deep learning