Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2021, Vol. 57 ›› Issue (5): 804-814.DOI: 10.13209/j.0479-8023.2021.020

Previous Articles     Next Articles

Estimation of Area of Completed Houses Based on Statistical Yearbooks and Online Big Data

YUAN Wen1, WANG Jun1, SHEN Hongyi1, WANG Xinmin2,†   

  1. 1. Academy for Advanced Interdisciplinary Studies, Peking University, Beijing 100871 2. School of Mathematical Sciences, Peking University, Beijing 100871
  • Received:2020-06-29 Revised:2021-01-18 Online:2021-09-20 Published:2021-09-20
  • Contact: WANG Xinmin, E-mail: wangxinmin(at)


原雯1, 王君1, 申鸿怡1, 王新民2,†   

  1. 1. 北京大学前沿交叉学科研究院, 北京 100871 2. 北京大学数学科学学院, 北京 100871
  • 通讯作者: 王新民, E-mail: wangxinmin(at)
  • 基金资助:


The authors select several indicators in the Beijing Yearbook to construct an economic and social factor system, and uses partial least squares regression, LASSO regression and RBF neural network models to predict the area of completed buildings in Beijing in 2017 and 2018. However, considering the difference of statistical channels and granularity of the yearbook indicators, and the delay in the release of some indicator data for the construction industry in 2019, it is hard to estimate the area of the year by model fitting. Therefore, crawler technology is used to obtain high-quality data and dig deep to obtain information of online big data to estimate the completed area. Firstly, a web-based building data acquisition framework is established, to crawl the attribute data of eight types of buildings in Beijing by calling service interface, keyword search and other technologies. Secondly, regular expressions and conditional filtering are used to extract and clean the HTML data returned by web pages. Finally, the area of completed houses in Beijing and the area of each functional partition in 2019 are estimated.

Key words: area of completed houses, regression analysis, web crawler, template extraction


选择北京市年鉴中的若干数据指标, 构建经济社会因子体系, 采用偏最小二乘回归、LASSO回归和RBF神经网络3种模型, 对2017和 2018年北京市房屋竣工面积进行预测。由于各年鉴数据统计渠道和指标粒度不同, 且2019年建筑业部分指标数据的公布存在延迟, 难以用模型拟合的方式对该年度竣工面积做出估计。因此, 利用爬虫技术获取高质量数据, 并深入挖掘网络数据中的信息, 通过互联网大数据估算北京市房屋竣工面积。首先, 建立基于网络大数据的建筑数据获取框架, 通过调用服务接口和关键字搜索等技术, 爬取北京地区8类建筑物的属性数据; 然后, 利用正则表达式和条件过滤, 对网页返回的HTML非结构化数据进行抽取和清洗; 最后, 对 2019年北京市房屋竣工面积及各功能分区的竣工面积做出估算。

关键词: 竣工面积, 回归分析, 网络爬虫, 模板抽取