Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China..
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China..
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China..
Nanjing Univ, Sch Elect Sci & Engn, Nanjing 210093, Jiangsu, Peoples R China..
Show others and affiliations
2019 (English)In: ELECTRONICS, ISSN 2079-9292, Vol. 8, no 4, article id 371Article in journal (Refereed) Published
Abstract [en]

Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.

Place, publisher, year, edition, pages
MDPI , 2019. Vol. 8, no 4, article id 371
Keywords [en]
low bit-width convolutional neural networks, parallel streaming architecture, coarse grain task partitioning, reconfigurable, VLSI
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:kth:diva-252646DOI: 10.3390/electronics8040371ISI: 000467751100002Scopus ID: 2-s2.0-85064599789OAI: oai:DiVA.org:kth-252646DiVA, id: diva2:1321856
Note

QC 20190610

Available from: 2019-06-10 Created: 2019-06-10 Last updated: 2019-06-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records BETA

Lu, Zhonghai

Search in DiVA

By author/editor
Lu, Zhonghai
By organisation
Electronics
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 80 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf