北京大学学报自然科学版 ›› 2024, Vol. 60 ›› Issue (3): 393-402.DOI: 10.13209/j.0479-8023.2024.034

上一篇    下一篇

基于分层融合策略和上下文信息嵌入的多模态情绪识别

孙明龙, 欧阳纯萍, 刘永彬, 任林   

  1. 南华大学计算机学院, 衡阳 421200
  • 收稿日期:2023-05-19 修回日期:2023-07-30 出版日期:2024-05-20 发布日期:2024-05-20
  • 通讯作者: 欧阳纯萍, E-mail: ouyangcp(at)126.com
  • 基金资助:
    湖南省自然科学基金(2022JJ30495)和湖南省教育厅重点科研项目(22A0316)资助

Multimodal Emotion Recognition Based on Hierarchical Fusion Strategy and Contextual Information Embedding

SUN Minglong, OUYANG Chunping, LIU Yongbin, REN Lin   

  1. School of Computing, University of South China, Hengyang 421200
  • Received:2023-05-19 Revised:2023-07-30 Online:2024-05-20 Published:2024-05-20
  • Contact: OUYANG Chunping, E-mail: ouyangcp(at)126.com

摘要:

现有的多模态融合策略大多将不同模态特征进行简单拼接, 忽略了针对单个模态固有特点的个性化融合需求。同时, 在情绪识别阶段, 独立地看待单个话语的情绪而不考虑其在前后话语语境下的情绪状态, 可能导致情绪识别错误。为了解决上述问题, 提出一种基于分层融合策略和上下文信息嵌入的多模态情绪识别方法, 通过分层融合策略, 采用层次递进的方式, 依次融合不同的模态特征, 以便减少单个模态的噪声干扰并解决不同模态间表达不一致的问题。该方法还充分利用融合后模态的上下文信息, 综合考虑单个话语在上下文语境中的情绪表示, 以便提升情绪识别的效果。在二分类情绪识别任务中, 该方法的准确率比SOTA模型提升1.54%。在多分类情绪识别任务中, 该方法的F1值比SOTA模型提升2.79%。

关键词: 分层融合, 噪声干扰, 上下文信息嵌入

Abstract:

Existing fusion strategies often involve simple concatenation of modal features, disregarding personalized fusion requirements based on the characteristics of each modality. Additionally, solely considering the emotions of individual utterances in isolation, without accounting for their emotional states within the context, can lead to errors in emotion recognition. To address the aforementioned issues, this paper proposes a multimodal emotion recognition method based on a layered fusion strategy and the incorporation of contextual information. The method employs a layered fusion strategy, progressively integrating different modal features in a hierarchical manner to reduce noise interference from individual modalities and address inconsistencies in expression across different modalities. It leverages the contextual information to comprehensively analyze the emotional representation of each utterance within the context, enhancing overall emotion recognition performance. In binary emotion classification tasks, the proposed method achieves a 1.54% improvement in accuracy compared with the state-of-the-art (SOTA) model. In multi-class emotion recognition tasks, the F1 score is improved by 2.79% compared to SOTA model.

Key words: hierarchical fusion, noise interference, context information embedding