北京大学学报(自然科学版)

基于隐主题马尔科夫模型的多特征自动文摘

刘江鸣,徐金安,张玉洁   

  1. 北京交通大学计算机与信息技术学院, 北京 100044;
  • 收稿日期:2013-06-21 出版日期:2014-01-20 发布日期:2014-01-20

Summarization Based on Hidden Topic Markov Model with Multi-features

LIU Jiangming, XU Jin’an, ZHANG Yujie   

  1. School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100871;
  • Received:2013-06-21 Online:2014-01-20 Published:2014-01-20

摘要: 基于隐主题马尔科夫模型, 消除LDA主题模型的主题独立假设, 使得文摘生成过程中充分利用文章的结构信息, 并结合基于内容的多特征方法提高文摘质量。提出在不破坏文章结构的前提下, 从单文档扩展到多文档的自动文摘策略, 最终搭建完善的自动文摘系统。在DUC2007标准数据集上的实验证明了隐主题马尔科夫模型和文档特征的优越性, 所实现的自动文摘系统ROUGE值有明显提高。

关键词: 多文档自动文摘, 隐主题马尔科夫模型, 多特征

Abstract: Based on hidden topic Markov model (HTMM), the authors eliminate assumption limitation in LDA (latent dirichlet allocation) to exploit the structure information during generating summary, and use multi-features based on document content to improve the summary quality. Furthermore, a method for developing single-document summarization to multi-document summarization without breaking document structure is proposed, to achieve the perfect automatic summarization system. Meanwhile, experiment results on the standard dataset DUC2007 show the advantage of HTMM and multi-feature. Compared with the performace of LDA, ROUGE values are improved based on HTMM with multi-features.

Key words: multi-features, multi-document summarization, hidden topic Markov model

中图分类号: