Acta Scientiarum Naturalium Universitatis Pekinensis

Previous Articles     Next Articles

C-TERN: A Temporal Information Processing Algorithm of Chinese Military News Story Based on Cascade Finite State Automata

WANG Wei1,2, ZHAO Dongyan2, SU Tingting1   

  1. 1. Key Laboratory on Information Security, Engineering University of CAPF, Xi’an 710086; 2. Institute of Computer Science and Technology, Peking University, Beijing 100080;
  • Received:2013-06-16 Online:2014-01-20 Published:2014-01-20

C-TERN: 一种基于CFSA的军事新闻文本时间信息处理算法

王伟1,2,赵东岩2,苏婷婷1   

  1. 1. 武警工程大学信息安全重点实验室, 西安 710086; 2. 北京大学计算科学与技术研究所, 北京100080;

Abstract: The authors propose a new method C-TERN to recognize and normalize the temporal expression in military story based on cascade finite state automata. Firstly, C-TERN recognizes the temporal expression in military story, and layers the temporal information extracted from general language and military language, and recognizes the temporal by layer. Then, in the procedure of temporal expression normalization, C-TERN ratiocinates and normalizes the simple/specify time, duration time, absolute and relative temporal expression in four steps. The method pays special attention to the correctness of the regulation extraction, the dispelling of the collision between regulations, and the reasonability of the matching method. The experimental results on multi-information show that proposed method can recognize and normalize the absolute and relative temporal expression as well as the simple/specify time and duration time effectively. It can better meets the temporal information processing needs in military applications.

Key words: natrual language processing, finite state automata, temporal expression, recognition and normalization

摘要: 提出一种基于层叠有限状态自动机(CFSA)的中文军事文本时间表达式识别与规范化算法C-TERN。C-TERN首先利用成熟的分词工具识别出文本中的时间词, 然后将从通用语言和军事语言中提取的时间表达式规则分成多层, 逐层进行时间信息的精细识别。在规范化过程中, 通过4个步骤分别对特殊时间表达式、简单时间表达式、时间段表达式和绝对/相对时间表达式进行推理计算和规范化。算法考虑了规则集提取的正确性、规则之间冲突的消解以及匹配方式的合理性。在多个数据集上的实验结果显示, C-TERN不但能有效地识别标准时间、偏移时间和不确定性时间表达式, 而且能完成对简单、特殊以及隐含的时间点、时间段和偏移时间的推理与规范化, 能够满足军事文本时间信息处理的需要。

关键词: 自然语言理解, 有限状态自动机, 时间表达式, 识别与规范化

CLC Number: