Acta Scientiarum Naturalium Universitatis Pekinensis

Previous Articles     Next Articles

Automatic Table Boundary Detection and Performance Evaluation in Fixed-Layout Documents

FANG Jing1, GAO Liangcai1, QIU Ruiheng1,2,3, TANG Zhi1   

  1. 1. Institute of Computer Science & Technology, Peking University, Beijing 100080; 2. State Key Laboratory of Digital Publishing Technology, Beijing 100080; 3. Founder Group Substation, Postdoctoral Workstation of the Zhongguancun Haidian Science Park, Beijing 100080;
  • Received:2012-06-03 Online:2013-01-20 Published:2013-01-20

版式电子文档表格自动检测与性能评估

房婧1,高良才1,仇睿恒1,2,3,汤帜1   

  1. 1. 北京大学计算机科学技术研究所, 北京 100080; 2. 数字出版技术国家重点实验室, 北京 100080; 3. 中关村科技园区海淀园企业博士后科研工作站北大方正集团公司分站, 北京 100080;

Abstract: The authors propose a novel and effective table boundary detection method via visual separators and geometric content layout information, which is effective for both Chinese and English documents. Additionally, due to the lack of automatic evaluation system for table boundaries detection, the authors also provide a publicly available large-scale dataset, composed of same amount of Chinese and English pages make ground-truth and propose mobile reading oriented performance measurements. Evaluation and comparison with two other open source table boundary detection projects demonstrates effectiveness of the proposed method and practicality of the evaluation suit.

Key words: fixed-layout document, table location, table detection, automatic performance evaluation

摘要: 针对版式电子文档的特点, 提出一种表格线分割符和表格文本的布局特征相结合的表格定位方法, 并且对中英文档均有效。此外, 针对缺少表格定位自动评估体系, 构建了一个初具规模的公开数据集, 由中英文版式页面等比例组成, 对其标注基准结果, 并针对移动阅读应用场景提出一套评估准则。通过与现有两个开源表格定位项目的比较, 验证了新提出的表格定位方法的有效性和评估体系的实用性, 特别是对中文数据集获得了较好的结果。

关键词: 版式文档, 表格定位, 表格检测, 自动性能评估

CLC Number: