Acta Scientiarum Naturalium Universitatis Pekinensis

Previous Articles     Next Articles

Chinese Word Segmentation for Patent Documents

YUE Jinyuan, XU Jin’an, ZHANG Yujie   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044;
  • Received:2012-06-04 Online:2013-01-20 Published:2013-01-20

面向专利文献的汉语分词技术研究

岳金媛,徐金安,张玉洁   

  1. 北京交通大学计算机与信息技术学院, 北京 100044;

Abstract: According to the characteristics of the patent documents, the authors present a statistics approach for Chinese word segmentation based on domain dictionaries. NC-value algorithm and conditional random fields model (CRF) are adopted for the domain terms extraction, to solve the unknown words recognition issue. The experimental results show that the proposed method can improve the efficiency of the word segmentation and the identification of the unknown words. For an open test, the precision of the experimental results is 95.56 %, the recall-rate is 96.18%, and F-measure is 95.87%.

Key words: Chinese word segmentation, conditional random fields (CRF), domain terms extraction

摘要: 针对专利文献专业术语多、领域广的特点, 采用基于领域词典与统计相结合的方法探讨了专利文献的汉语分词问题。利用NC-value算法抽取专业术语, 使用条件随机场模型(CRF)提高专业术语识别率, 提高分词精度。实验结果表明, 提出的方法在开放测试下分词的准确率为95.56%, 召回率为96.18%, F值为95.87%, 大大提高了专利文献的分词精度。

关键词: 汉语分词, 条件随机场, 专业术语提取

CLC Number: