北京大学学报(自然科学版)

• 北京大学学报 •

版面相似中文表单的分类方法研究

王思萌,高良才,王悦涵,李平立,汤帜   

  1. 北京大学计算机科学技术研究所, 北京 100080;
  • 收稿日期:2014-06-28 出版日期:2015-03-20 发布日期:2015-03-20

A Study on Classification of Forms with Similar Layout

WANG Simeng, GAO Liangcai, WANG Yuehan, LI Pingli, TANG Zhi   

  1. Institute of Computer Science and Technology, Peking University, Beijing 100080;
  • Received:2014-06-28 Online:2015-03-20 Published:2015-03-20

摘要: 针对具有相似版面的中文表单, 提出一种简单有效的基于距离度量的表单分类方法, 该方法对表单的用户填写信息、布局信息和位置偏移分别进行距离度量, 并通过3种权重有效地降低用户填写信息的随机性、版面相似表单的布局一致性和位置抖动性对表单分类的影响。实验表明, 所提方法在多个中文表单图像库上的分类准确率达到90%以上, 比目前最新的表单分类方法有明显提高。

关键词: 表单分类, 距离度量, 权重计算, 表单分类, 距离度量, 权重计算

Abstract: The authors propose a simple but effective distance based method to identify forms with similar layouts by measuring the user filled-in data, preprinted data and dithering data. The proposed method utilizes three kinds of weight components to mitigate the impact of randomness of user filled-in data, consistency of similar layouts and position dithering respectively. Experimental results show that the proposed method can achieve more than 90% classification accuracy on a series of data sets, which is significantly better than the results of the state-of-the-art method.

Key words: form classification, distance metric, weight calculation, form classification, distance metric, weight calculation

中图分类号: