Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2016, Vol. 52 ›› Issue (1): 25-34.DOI: 10.13209/j.0479-8023.2016.022

Previous Articles     Next Articles

Research on the Construction of Bilingual Movie Knowledge Graph

WANG Weiwei, WANG Zhigang, PAN Liangming, LIU Yang, ZHANG Jiangtao   

  1. Knowledge Engineering Group, Department of Computer Science and Technology, Tsinghua University, Beijing 100084
  • Received:2015-06-06 Online:2016-01-20 Published:2016-01-20
  • Contact: WANG Weiwei, E-mail: wangzigo(at)gmail.com

双语影视知识图谱的构建研究

王巍巍, 王志刚, 潘亮铭, 刘阳, 张江涛   

  1. 清华大学计算机科学与技术系知识工程实验室, 北京 100084
  • 通讯作者: 王志刚, E-mail: wangzigo(at)gmail.com
  • 基金资助:

    国家重点基础研究发展计划(2014CB340504)、国家自然科学基金委员会与法国国家科研署双边合作协议(61261130588)、清华大学自主科研项目(20131089256)、国家科技支撑计划(2014BAK04B00)和THU-NUS 下一代搜索联合研究中心项目资助

Abstract:

This paper proposes a method to construct Bilingual Movie Knowledge Graph (BMKG). The authors first builds Bilingual Movie Ontology (BMO) through a semi-automatic way, and aligns each data source with it in order to ensure semantic consistency of heterogeneous data sources. For entity linking, the proposed method makes best use of the field characteristics and calculate entity similarity based on both Word2Vec and TFIDF models, which greatly improve entity linking. For entity matching, a similarity flooding based algorithm is proposed, which utilizes the intrinsic links between the movie data sources, addressing the problem of similarity computation between cross-lingual entities. The experiment results show that the entity matching precision is over 90% when the threshold is above 0.75. In addition, a movie knowledge graph sharing platform is also built to provide open data access and query interface.

Key words: words movie ontology, bilingual, knowledge graph

摘要:

提出一种双语影视知识图谱(BMKG)的构建流程。通过半自动化的方法构建了双语影视本体(BMO), 将各个影视数据源对齐到BMO, 以保持异构数据源的语义描述一致性。在知识链接方面, 在充分挖掘和利用领域特征的基础上, 采用基于Word2Vec 和TFIDF 两种向量模型的实体相似度计算方法, 使相似度特征增加一倍, 大大提升了模型的链接效果。在实体匹配方面, 提出基于相似度传播算法的实体匹配算法, 并利用影视数据源之间的内在联系, 克服了跨语言实体之间计算相似度的语言障碍。实验结果表明, 当阈值取到0.75 以上时, 实体匹配的准确率都能达到90% 左右。此外, 还建立了影视知识图谱共享平台, 并提供开放性的数据访问和查询接口。

关键词: 影视本体, 双语, 知识图谱

CLC Number: