Acta Scientiarum Naturalium Universitatis Pekinensis

Previous Articles     Next Articles

Identification of Topic Sentence about Key Event in Chinese News

WANG Wei1,3, ZHAO Dongyan1,2, ZHAO Wei1   

  1. 1. Institute of Computer Science and Technology, Peking University, Beijing 100871; 2. Key Laboratory of Computational LinguisticsMOE, Peking University, Beijing 100871; 3. Department of Electronic Technology, Engineering College of Armed Police Force, Xi’an 710086;
  • Received:2010-09-10 Online:2011-09-20 Published:2011-09-20

中文新闻关键事件的主题句识别

王伟1,3,赵东岩1,2,赵伟1   

  1. 1. 北京大学计算科学与技术研究所, 北京 100871; 2. 计算语言学教育部重点实验室, 北京 100871; 3. 武警工程学院电子技术系, 西安 710086;

Abstract: The authors propose an approach to extract topic sentences that describe key event from a news article. Considering the special structure of news articles, the relations between news articles and key events reported in them is studied, as well as the characteristics of a news headline in three aspects: information, form and language. A novel method based on the information aspect of a headline is used to extract a topic sentence which contains the key event information from a news story. The method first classifies a news headline as informative or non-informative, and then considers text and semantic features of a sentence, such as word frequency, sentence length, location in the text and word co-concurrency with the headline, to evaluate the importance for each sentence and select the most important one as the topic sentence. Experiment results show that this method can identify a topic sentence accurately and the proposed approach makes a good preparation for event information extraction.

Key words: computer application, Chinese information processing, natural language processing, automatic text abstract, event extraction, news headline

摘要: 提出在单文档中通过提取主题句以获取关键事件信息的思想。根据新闻的体裁特点, 分析了新闻报道与事件的关系, 以及新闻标题在内容、形式和语言方面的特征。提出利用标题的提示性信息提取主题句来描述新闻关键事件的方法。该方法首先对新闻标题按信息含量进行分类, 然后结合新闻句子的词频、长度、位置、与标题的相似度等特征计算句子的重要性。实验表明, 该方法能够准确提取新闻主题句, 为进一步抽取事件信息打好了基础。

关键词: 计算机应用, 中文信息处理, 自然语言处理, 自动文摘, 事件抽取, 新闻标题

CLC Number: