Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2020, Vol. 56 ›› Issue (5): 785-795.DOI: 10.13209/j.0479-8023.2020.019

Previous Articles     Next Articles

Research on Cleaning and Repairing Methods of Civil Building Data on Resources Saving and Environment Protection

SHEN Hongyi1, XU Fangfang2, WANG Xinmin3,†   

  1. 1. Center for Data Science, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871 2. College of Mathematics and Systems Science, Shandong University of Science and Technology, Qingdao 266590 3. School of Mathematical Sciences, Peking University, Beijing 100871
  • Received:2019-08-13 Revised:2019-11-07 Online:2020-09-20 Published:2020-09-20
  • Contact: WANG Xinmin, E-mail: wangxinmin(at)


申鸿怡1, 徐芳芳2, 王新民3,†   

  1. 1. 北京大学前沿交叉学科研究院大数据科学研究中心, 北京 100871 2. 山东科技大学数学与系统科学学院, 青岛 266590 3. 北京大学数学科学学院, 北京 100871
  • 通讯作者: 王新民, E-mail: wangxinmin(at)
  • 基金资助:


Aiming at the data quality issues existing in the original civil building data on resources saving and environment protection, various methods are used to achieve data cleaning and data repairing. In terms of data cleaning, the authors focus on the approximately duplicated records and abnormal records in the energy consumption data of single building. In particular, the methods for identifying abnormal records include the empirical rule, the DBSCAN clustering algorithm, and inner fence of boxplot. In terms of data repairing, the authors focus on completing missing values and using the models to achieve data correction. In particular, the missing values are filled in these ways: existing values in the datasets, the predicted values of the linear regression model, and the output of the user-based collaborative filtering recommendation algorithm. The average absolute error is used as an evaluation index to compare these filling results. While repairing the building energy consumption data from Shanghai, multiple linear regression, principal component regression, partial least squares regression, ridge regression and Lasso regression are used to fit the correlation between building energy consumption and explanatory variables. The results show that for the energy consumption data of single building, it’s suitable to use the inner fence of boxplot to identify abnormal records, and use the median to complete missing values. For the building energy consumption data from Shanghai, the ridge regression model fits best.

Key words: resources saving and environment protection, data cleaning, data repairing, DBSCAN clustering algorithm, user-based collaborative filtering, ridge regression


针对民用建筑“四节一环保”原始数据中存在的数据质量问题, 使用多种方法实现数据清洗与数据修复。数据清洗方面, 重点关注单栋建筑能耗数据中存在的相似重复记录及异常记录。其中, 识别异常记录采用3σ准则、DBSCAN聚类算法及箱线图内限3种方法。数据修复方面, 重点关注缺失值的填补及基于模型的数据修正。其中, 缺失值的填充使用简单填充、线性回归模型和基于用户的协同过滤推荐算法, 并以平均绝对误差为评估指标进行对比。基于多元线性回归、主成分回归、偏最小二乘回归、岭回归及Lasso回归5种模型, 拟合建筑运行能耗与各解释变量间的关系, 对上海市建筑运行能耗相关数据进行数据修复。结果显示, 单栋建筑能耗数据适合采用箱线图内限来识别异常记录, 并使用中位数填补缺失数据; 上海市建筑运行能耗相关数据中, 岭回归模型的拟合情况最好。

关键词: 四节一环保, 数据清洗, 数据修复, DBSCAN聚类算法, 基于用户的协同过滤推荐算法, 岭回归