北京大学学报自然科学版 ›› 2020, Vol. 56 ›› Issue (1): 75-81.DOI: 10.13209/j.0479-8023.2019.105

上一篇    下一篇

基于情感信息辅助的多模态情绪识别

吴良庆, 刘启元, 张栋, 王建成, 李寿山, 周国栋   

  1. 苏州大学计算机科学与技术学院, 苏州 215006
  • 收稿日期:2019-05-22 修回日期:2019-09-19 出版日期:2020-01-20 发布日期:2020-01-20
  • 通讯作者: 张栋, E-mail: dzhang17(at)stu.suda.edu.cn
  • 基金资助:
    国家自然科学基金(61331011, 61375073)资助

Multimodal Emotion Recognition with Auxiliary Sentiment Information

WU Liangqing, LIU Qiyuan, ZHANG Dong, WANG Jiancheng, LI Shoushan, ZHOU Guodong   

  1. School of Computer Science & Technology, Soochow University, Suzhou 215006
  • Received:2019-05-22 Revised:2019-09-19 Online:2020-01-20 Published:2020-01-20
  • Contact: ZHANG Dong, E-mail: dzhang17(at)stu.suda.edu.cn

摘要:

不同于纯文本的情绪分析, 本文面向多模态数据(文本和语音)进行情绪识别研究。为了同时考虑多模态数据特征, 提出一种新颖的联合学习框架, 将多模态情绪分类作为主任务, 多模态情感分类作为辅助任务, 通过情感信息来辅助提升情绪识别任务的性能。首先, 通过私有网络层对主任务中的文本和语音模态信息分别进行编码, 以学习单个模态内部的情绪独立特征表示。接着, 通过辅助任务中的共享网络层来获取主任务的辅助情绪表示以及辅助任务的单模态完整情感表示。在得到主任务的文本和语音辅助情绪表示之后, 分别与主任务中的单模态独立特征表示相结合, 得到主任务中单模态情绪信息的完整表示。最后, 通过自注意力机制捕捉每个任务上的多模态交互特征, 得到最终的多模态情绪表示和情感表示。实验结果表明, 本文方法在多模态情感分析数据集上可以通过情感辅助信息大幅度地提升情绪分类任务的性能, 同时情感分类任务的性能也得到一定程度的提升。

关键词: 多模态, 情绪识别, 联合学习, 情感分析

Abstract:

Different from the previous studies with only text, this paper focuses on multimodal data (text and audio) to perform emotion recognition. To simultaneously address the characteristics of multimodal data, we propose a novel joint learning framework, which allows auxiliary task (multimodal sentiment classification) to help the main task (multimodal emotion classification). Specifically, private neural layers are designed for text and audio modalities from the main task to learn the uni-modal independent dynamics. Secondly, with the shared neural layers from auxiliary task, we obtain the uni-modal representations of the auxiliary task and the auxiliary representations of the main task. The uni-modal independent dynamics is combined with the auxiliary representations for each modality to acquire the uni-modal representations of the main task. Finally, in order to capture multimodal interactive dynamics, we fuse the text and audio modalities’ representations for the main and auxiliary tasks separately to obtain the final multimodal emotion and sentiment representations with the self attention mechanism. Empirical results demonstrate the effectiveness of our approach to multimodal emotion classification task as well as the sentiment classification task.

Key words: multimodal, emotion recognition, joint learning, sentiment analysis