北京大学学报(自然科学版)

唐宋诗之计算机辅助深层研究

胡俊峰,俞士汶   

  1. 北京大学计算机科学技术系,北京大学计算语言学研究所,北京,100871
  • 收稿日期:2000-08-15 出版日期:2001-09-20 发布日期:2001-09-20

The Computer Aided Research Work of Chinese Ancient Poems

HU Junfeng,YU Shiwen   

  1. Institute of Computational Linguistic, Peking University Department of Computer Science, Peking University, Beijing, 100871
  • Received:2000-08-15 Online:2001-09-20 Published:2001-09-20

摘要: 介绍了北大计算语言学研究所开发的“唐宋诗计算机辅助研究系统”。该系统以全唐诗(481万字)和宋代部分名家诗(160万字)组成的语料库为基础,运用计算语言学方法对唐宋诗进行分析研究,提取了唐宋诗中的词汇,计5万余条目。在对诗文进行词语切分的基础上,建立了词汇的共现关系、对仗关系以及词汇的作者分布特征信息。系统除了提供面向诗文内容的全文检索功能外,还进一步开发了基于词汇的统计分析和诗句相似性检索等功能,实现了对全唐诗的自动注音。

关键词: 语料库语言学, 未登录词发现, 自动注音, 唐宋诗辅助研究

Abstract: Based on 6.4 million chars of Chinese ancient poetry, the “Computer aided research system of Chinese ancient poems” provides a word-based analysis platform of Chinese ancient poems. More than 50000 Chinese words, including 40814 multi-char words, were extracted from the corpora via statistic method. Besides the full text retrieving function, the system also provide word-based statistic analysis, sentence based similarity retrieving, automatic Pinyin tagging and some other useful functions to benefit the profound analysis of the Chinese ancient poems. The National Social Science Foundation of China 1998-1999 funded the project.

Key words: corpus linguistic, unlisted word discovery, automatic pinyin tagging, computer-aided analysis of Chinese ancient poems

中图分类号: