Acta Scientiarum Naturalium Universitatis Pekinensis

Previous Articles     Next Articles

Research on Entity Linking of Chinese Microblog

ZHU Min, JIA Zhen, ZUO Ling, WU Anjun, CHEN Fangzheng, BAI Yu   

  1. School of Information and Science Technology, Southwest Jiaotong University, Chengdu 610031;
  • Received:2013-07-05 Online:2014-01-20 Published:2014-01-20



  1. 西南交通大学信息科学技术学院, 成都 610031;

Abstract: The authors focus on the task of entity linking of Chinese microblog in NLP&CC2013, taking Sina microblog data provided by CCF as training data and test data, and Yebol Chinese segmentation system as segmentation technology. A way of entity linking is proposed which links from knowledge base to search entity from thesaurus, using improved pinyin edit distance and suffix vocabulary matching method. The authors also propose a disambiguation method, and the method combine entity clustering disambiguation and similar entity disambiguation based on Baidu encyclopedia. In the task of Chinese microblog entity linking of CCF, this system performs as the third-most-correct-probability system with a correct rate of 0.8838 in ten systems. The result indicates that the proposed entity link and entity disambiguation has efficiency and the ability to apply noise in text.

Key words: microblog entity linking, improved pinyin edit distance, suffix vocabulary matching method, entity disambiguation

摘要: 针对2013年CCF自然语言处理与中文计算会议(NLP&CC2013)中文微博实体链接的任务, 使用CCF提供的新浪微博数据作为训练和测试数据, 利用西南交通大学耶宝智慧中文分词平台作为自然语言预处理工具, 提出一种实体链接的方法。该方法应用改进的拼音编辑距离算法和后缀词表匹配法, 提出实体聚类消歧与基于百度百科词频的同类实体消歧相结合的消歧方法。在2013年CCF 中文微博实体链接评测任务中正确结果的准确率为0.8838, 在10 个参赛队伍中名列第3位。表明该方法有效并可以适应文本中的噪声。

关键词: 微博实体链接, 改进的拼音编辑距离, 后缀词表匹配法, 实体消歧

CLC Number: