Acta Scientiarum Naturalium Universitatis Pekinensis ›› 2022, Vol. 58 ›› Issue (6): 1035-1041.DOI: 10.13209/j.0479-8023.2022.089

Previous Articles     Next Articles

Design and Implementation of Object Detection Acceleration Module Based on an ARM+FPGA Heterogeneous Platform

LI Fang1, CAO Jian1,†, LI Pu1, XIE Hao1, ZHAO Xiongbo2, WANG Yuan3,†, ZHANG Xing1,†   

  1. 1. School of Software & Microelectronics, Peking University, Beijing 102600 2. Beijing Aerospace Automatic Control Institute, Beijing 100070 3. School of Integrated Circuits, Peking University, Beijing 100871
  • Received:2021-12-20 Revised:2022-05-18 Online:2022-11-20 Published:2022-11-20
  • Contact: CAO Jian, E-mail: caojian(at)ss.pku.edu.cn, WANG Yuan, E-mail: wangyuan(at)pku.edu.cn,ZHANG Xing, E-mail: zhx(at)pku.edu.cn

基于ARM+FPGA异构平台的目标检测加速模块设计与实现

李放1, 曹健1,†, 李普1, 谢豪1, 赵雄波2, 王源3,†, 张兴1,†   

  1. 1. 北京大学软件与微电子学院, 北京 102600 2. 北京航天自动控制研究所, 北京 100070 3. 北京大学集成电路学院, 北京 100871
  • 通讯作者: 曹健, E-mail: caojian(at)ss.pku.edu.cn, 王源, E-mail: wangyuan(at)pku.edu.cn,张兴, E-mail: zhx(at)pku.edu.cn
  • 基金资助:
    国家重点研发计划项目(2018YFE0203801)资助

Abstract:

Object detection algorithms based on deep learning use big models are difficult to be deployed at the edge. Taking YOLO (you only look once) object detection algorithm as an example, an acceleration module based on an ARM+FPGA heterogeneous platform is proposed. The FPGA chip accelerates the forward process of the compressed model while ARM is responsible for process scheduling. Experiment results show that the peak performance of the system reaches 425.8 GOP/s under 200 MHz working frequency. The system on a Xilinx ZCU102 board achieves a frame rate at 30.3 fps, while the power consumption is 3.56 W. It is also configurable.

Key words: deep learning, object detection, model pruning and quantization, heterogeneous platform, edge computing

摘要:

为解决基于深度学习目标检测模型规模大、在边缘设备上难以部署的问题, 以YOLO目标检测模型为例, 设计实现基于ARM+FPGA异构平台的目标检测加速模块。该系统使用剪枝、量化后的压缩模型, 在FPGA实现神经网络前向推理加速, 在ARM中实现加速器调度。实验结果表明, 部署至Xilinx ZCU102开发板上, 该模块在200 MHz工作频率下, 平均计算性能达到425.8 GOP/s, 推理压缩模型速度达到30.3 fps, 模块功耗为3.56 W, 证明该加速模块具备可配置性。

关键词: 深度学习, 目标检测, 模型剪枝量化, 异构平台, 边缘计算