北京大学学报(自然科学版)

中文电子文档的数学公式定位研究

林晓燕,高良才,汤帜   

  1. 北京大学计算机科学技术研究所, 北京 100080;
  • 收稿日期:2013-06-21 出版日期:2014-01-20 发布日期:2014-01-20

Research on Mathematical Formula Identification in Digital Chinese Documents

LIN Xiaoyan, GAO Liangcai, TANG Zhi   

  1. Institute of Computer Science and Technology, Peking University, Beijing 100080;
  • Received:2013-06-21 Online:2014-01-20 Published:2014-01-20

摘要: 区别于传统基于图像和西文文档的公式定位方法, 针对中文电子文档的特点, 提出一种基于机器学习和规则相结合的独立公式和内嵌公式的定位方法。设计了适合中文文档的页面分行策略和词块划分规则; 选择适合中文文档的公式特征和机器学习算法; 针对公式定位中的过分割问题, 提出行合并与词块合并等后处理手段。实验结果表明, 该方法可以有效地从中文电子文档中自动定位公式区域。此外, 构建了公开可用的中文数据集, 以促进不同数学公式定位方法间的相互比较及性能评估。

关键词: 数学公式识别, 数学公式定位, 电子文档, 中文文档

Abstract: Different from the traditional formula identification methods for scanned images and Latin documents, a formula identification method which considers the characteristics of digital Chinese documents is proposed to identify both isolated and embedded formulae using both machine learning techniques and heuristic rules. Text line detection strategies and word segmentation rules are proposed towards Chinese documents, effective features and machine learning algorithms of formula identification from Chinese documents are selected, and post-processing techniques, including text line or word merging, are proposed to overcome the over-segmentation problems. The experimental results show that the proposed method achieves satisfactory results in identifying formulae from digital Chinese documents. Furthermore, a public Chinese document dataset is constructed in order to facilitate the fair comparison between different formula identification methods.

Key words: mathematical formula recognition, mathematical formula identification, digital documents, Chinese documents

中图分类号: