TY - GEN
T1 - CNN-MERP
T2 - 34th IEEE International Conference on Computer Design, ICCD 2016
AU - Han, Xushen
AU - Zhou, Dajiang
AU - Wang, Shihao
AU - Kimura, Shinji
N1 - Publisher Copyright:
© 2016 IEEE.
Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2016/11/22
Y1 - 2016/11/22
N2 - Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Moreover, VLSI solutions are usually regarded as a lack of the flexibility to be reconfigured for the various parameters of CNNs. This paper presents CNN-MERP to address these issues. CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/off-chip data allocation, data flow optimization and data reuse. The proposed 2-level reconfigurability is utilized to enable fast and efficient reconfiguration, which is based on the control logic and the multiboot feature of FPGA. As a result, an external memory bandwidth requirement of 1.94MB/GFlop is achieved, which is 55% lower than prior arts. Under limited DRAM bandwidth, a system throughput of 1244GFlop/s is achieved at the Vertex UltraScale platform, which is 5.48 times higher than the state-of-the-art FPGA implementations.
AB - Large-scale deep convolutional neural networks (CNNs) are widely used in machine learning applications. While CNNs involve huge complexity, VLSI (ASIC and FPGA) chips that deliver high-density integration of computational resources are regarded as a promising platform for CNN's implementation. At massive parallelism of computational units, however, the external memory bandwidth, which is constrained by the pin count of the VLSI chip, becomes the system bottleneck. Moreover, VLSI solutions are usually regarded as a lack of the flexibility to be reconfigured for the various parameters of CNNs. This paper presents CNN-MERP to address these issues. CNN-MERP incorporates an efficient memory hierarchy that significantly reduces the bandwidth requirements from multiple optimizations including on/off-chip data allocation, data flow optimization and data reuse. The proposed 2-level reconfigurability is utilized to enable fast and efficient reconfiguration, which is based on the control logic and the multiboot feature of FPGA. As a result, an external memory bandwidth requirement of 1.94MB/GFlop is achieved, which is 55% lower than prior arts. Under limited DRAM bandwidth, a system throughput of 1244GFlop/s is achieved at the Vertex UltraScale platform, which is 5.48 times higher than the state-of-the-art FPGA implementations.
KW - FPGA
KW - backward propagation
KW - convolutional neural networks
KW - memory bandwidth
KW - reconfigurable processor
UR - http://www.scopus.com/inward/record.url?scp=85006705647&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85006705647&partnerID=8YFLogxK
U2 - 10.1109/ICCD.2016.7753296
DO - 10.1109/ICCD.2016.7753296
M3 - Conference contribution
AN - SCOPUS:85006705647
T3 - Proceedings of the 34th IEEE International Conference on Computer Design, ICCD 2016
SP - 320
EP - 327
BT - Proceedings of the 34th IEEE International Conference on Computer Design, ICCD 2016
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 2 October 2016 through 5 October 2016
ER -