北京大学学报(自然科学版)

多策略同义词获取方法研究

宋文杰1,2,顾彦慧1,2,周俊生1,2,孙玉杰1,2,严杰1,曲维光1,2,3   

  1. 1. 南京师范大学计算机科学与技术学院, 南京 210023; 2. 江苏省信息安全保密技术工程研究中心, 南京210023; 3. 南京大学计算机软件新技术国家重点实验室, 南京210023;
  • 收稿日期:2014-06-30 出版日期:2015-03-20 发布日期:2015-03-20

Multi-strategies Extraction of Chinese Synonyms

SONG Wenjie1,2, GU Yanhui1,2, ZHOU Junsheng1,2, SUN Yujie1,2, YAN Jie1, QU Weiguang1,2,3   

  1. 1. School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023; 2. Jiangsu Research Center of Information Security & Privacy Technology, Nanjing 210023; 3. State Key Lab for Novel Software Technology, Nanjing University, Nanjing 210023;
  • Received:2014-06-30 Online:2015-03-20 Published:2015-03-20

摘要: 提出一种多策略同义词获取方法, 一方面利用《同义词词林》、《中文概念词典》等现有语义词典中蕴含的同义关系获取同义词, 另一方面根据百度百科信息框(Bdbk)中特征词和汉典网(Zdic)中HTML标记获取同义词, 同时采用DIPRE自动获取模式的方法, 从百度百科文本中发现置信度较高的模式和同义关系。实验结果表明, 所提方法在NLP&CC 2012同义词评测数据集中取得较好结果。利用该方法, 以《现代汉语语法信息词典》名词部分为目标, 构建一部同义词词典并进行人工校对, 为《现代汉语语法信息词典》构建较为完善的语义关系体系做出尝试。

关键词: 同义词, 关系抽取, 模式匹配, 网络百科

Abstract: Cilin and Chinese Concept Dictionary are used as dictionary resources in many NLP applications. The authors study some strategies on Chinese synonyms extraction according to key word of the infobox in Baidubaike and HTML tag of the web page in Zdic. Meanwhile, DIPRE (Dual Iterative Pattern Relation Expansion) is applied to discover high credible patterns and synonymous instances in Encyclopedia corpora. Extensive experimental evaluation demonstrates that proposed strategies outperform the NLP&CC 2012 evaluation results. A sophisticated synonym dictionary is built with manually proofreading for noun part of the Grammatical Knowledge-Base of Contemporary Chinese, which would make contributions to perfect the semantic systems of the Grammatical Knowledge-base of Contemporary Chinese.

Key words: synonym, relation extraction, pattern-based method, Encyclopedia

中图分类号: