Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2023, Vol. 59 ›› Issue (1): 48-56.DOI: 10.13209/j.0479-8023.2022.074

Previous Articles     Next Articles

Difference between Multi-modal vs. Text Pre-trained Models in Embedding Text

SUN Yuchong1, CHENG Xiwei2, SONG Ruihua1,3,†, CHE Wanxiang4, LU Zhiwu1,3, WEN Jirong1,3   

  1. 1. Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872 2. School of Statistics, Renmin University of China, Beijing 100872 3. Beijing Academy of Artificial Intelligence, Beijing 100084 4. Faculty of Computing, Harbin Institute of Technology, Harbin 150001
  • Received:2022-05-13 Revised:2022-08-18 Online:2023-01-20 Published:2023-01-20
  • Contact: SONG Ruihua, 通信作者, E-mail: rsong(at)ruc.edu.cn

多模态与文本预训练模型的文本嵌入差异研究

孙宇冲1, 程曦苇2, 宋睿华1,3,†, 车万翔4, 卢志武1,3, 文继荣1,3   

  1. 1. 中国人民大学高瓴人工智能学院, 北京 100872 2. 中国人民大学统计学院, 北京 100872 3. 北京智源人工智能研究院, 北京 100084 4. 哈尔滨工业大学计算学部, 哈尔滨 150001
  • 通讯作者: 宋睿华, 通信作者, E-mail: rsong(at)ruc.edu.cn
  • 基金资助:
    北京高校卓越青年科学家计划(BJJWZYJH012019100020098)资助

Abstract:

This paper provides quantitative comparison between the text embedding of a text pre-trained model (i.e., RoBERTa) and a multi-modal pre-trained model (i.e., WenLan). Two quantitative comparison methods are proposed, in an embedding space: representing the semantics of a word using the set of ?-nearest words to it and then analyze the semantic changes of the word in the two spaces using the Jaccard similarity of the two sets; forming pairs between each word and its nearest ? words to analyze the relationship. The results show that the multi-modal pre-training brings more semantic changes for more abstract words (e.g., success, love), and the multi-modal pre-trained model can better differentiate antonyms and discover more hypernyms or hyponyms, while text pre-training works better in finding synonyms. Moreover, multi-modal pre-trained model can construct a more extensive associative relationship between words.

Key words: multi-modal pre-training, text representation, text embedding analysis

摘要:

为了详细地分析文本单模态预训练模型RoBERTa和图文多模态预训练模型WenLan文本嵌入的差异, 提出两种定量比较方法, 即在任一空间中, 使用距离一个词最近的k近邻词集合表示其语义, 进而通过集合间的Jaccard相似度来分析两个空间中词的语义变化; 将每个词与其k近邻词组成词对, 分析词对之间的关系。实验结果表明, 图文多模态预训练为更抽象的词(如成功和爱情等)带来更多的语义变化, 可以更好地区分反义词, 发现更多的上下义词, 而文本单模态预训练模型更擅长发现同义词。另外, 图文多模态预训练模型能够建立更广泛的词之间的相关关系。

关键词: 多模态预训练, 文本表示, 文本嵌入分析