Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 193) Show all publications
Qin, Z., Zhu, D., Zhu, X., Chen, X., Shi, Y., Gao, Y., . . . Pan, H. (2019). Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights. ELECTRONICS, 8(1), Article ID 78.
Open this publication in new window or tab >>Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
Show others...
2019 (English)In: ELECTRONICS, ISSN 2079-9292, Vol. 8, no 1, article id 78Article in journal (Refereed) Published
Abstract [en]

As a key ingredient of deep neural networks (DNNs), fully-connected (FC) layers are widely used in various artificial intelligence applications. However, there are many parameters in FC layers, so the efficient process of FC layers is restricted by memory bandwidth. In this paper, we propose a compression approach combining block-circulant matrix-based weight representation and power-of-two quantization. Applying block-circulant matrices in FC layers can reduce the storage complexity from <mml:semantics>O(k2)</mml:semantics> to <mml:semantics>O(k)</mml:semantics>. By quantizing the weights into integer powers of two, the multiplications in the reference can be replaced by shift and add operations. The memory usages of models for MNIST, CIFAR-10 and ImageNet can be compressed by <mml:semantics>171x</mml:semantics>, <mml:semantics>2731x</mml:semantics> and <mml:semantics>128x</mml:semantics> with minimal accuracy loss, respectively. A configurable parallel hardware architecture is then proposed for processing the compressed FC layers efficiently. Without multipliers, a block matrix-vector multiplication module (B-MV) is used as the computing kernel. The architecture is flexible to support FC layers of various compression ratios with small footprint. Simultaneously, the memory access can be significantly reduced by using the configurable architecture. Measurement results show that the accelerator has a processing power of 409.6 GOPS, and achieves 5.3 TOPS/W energy efficiency at 800 MHz.

Place, publisher, year, edition, pages
MDPI, 2019
Keywords
hardware acceleration, deep neural networks (DNNs), fully-connected layers, network compression, VLSI
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-244134 (URN)10.3390/electronics8010078 (DOI)000457142800078 ()2-s2.0-85060368656 (Scopus ID)
Note

QC 20190218

Available from: 2019-02-18 Created: 2019-02-18 Last updated: 2019-03-18Bibliographically approved
Chen, Q., Fu, Y., Song, W., Cheng, K., Lu, Z., Zhang, C. & Li, L. (2019). An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks. ELECTRONICS, 8(4), Article ID 371.
Open this publication in new window or tab >>An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
Show others...
2019 (English)In: ELECTRONICS, ISSN 2079-9292, Vol. 8, no 4, article id 371Article in journal (Refereed) Published
Abstract [en]

Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.

Place, publisher, year, edition, pages
MDPI, 2019
Keywords
low bit-width convolutional neural networks, parallel streaming architecture, coarse grain task partitioning, reconfigurable, VLSI
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-252646 (URN)10.3390/electronics8040371 (DOI)000467751100002 ()2-s2.0-85064599789 (Scopus ID)
Note

QC 20190610

Available from: 2019-06-10 Created: 2019-06-10 Last updated: 2019-06-10Bibliographically approved
Zhang, W., Cao, Q. & Lu, Z. (2019). Bit-Flipping Schemes Upon MLC Flash: Investigation, Implementation, and Evaluation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(4), 780-784
Open this publication in new window or tab >>Bit-Flipping Schemes Upon MLC Flash: Investigation, Implementation, and Evaluation
2019 (English)In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 38, no 4, p. 780-784Article in journal (Refereed) Published
Abstract [en]

Multilevel cell (MLC) stales with lower threshold voltage endure less cell damage, lower retention error, and less current consumption. Based on these characteristics, it is opportunistic to strengthen MLC flash by introducing hit-flipping that reshapes state proportions on MLC pages. In this paper. we present a holistic study of bit-flipping schemes upon MLC flash in theory and practice. Specifically, we systematically investigate effective bit-flipping schemes and propose four new schemes on manipulating MLC states. We further design a generic implementation framework, named MLC bit-flipping framework, to implement bit-flipping schemes within solid state drives controllers, nicely integrating with existing system-level optimizations to further improve overall performance. The experimental results demonstrate that our proposed bit-flipping schemes standalone can reduce up to 28% cell damages and 53% retention errors. Our circuit-level simulation manifests that the bit-flipping latency on a page is less than 4 mu s when using 8K logic gates.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
Bit-flipping, lifetime extension, multilevel cell (MLC) flash, retention error reduction, state dependent damage
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-249871 (URN)10.1109/TCAD.2018.2818693 (DOI)000462370000016 ()2-s2.0-85044344391 (Scopus ID)
Note

QC 20190424

Available from: 2019-04-24 Created: 2019-04-24 Last updated: 2019-04-24Bibliographically approved
Ma, R., Wu, F., Zhang, M., Lu, Z., Wan, J. & Xie, C. (2019). RBER-Aware Lifetime Prediction Scheme for 3D-TLC NAND Flash Memory. IEEE Access, 7, 44696-44708
Open this publication in new window or tab >>RBER-Aware Lifetime Prediction Scheme for 3D-TLC NAND Flash Memory
Show others...
2019 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 7, p. 44696-44708Article in journal (Refereed) Published
Abstract [en]

NAND flash memory is widely used in various computing systems. However, flash blocks can sustain only a limited number of program/erase (P/E) cycles, which are referred to as the endurance. On one hand, in order to ensure data integrity, flash manufacturers often define the maximum P/E cycles of the worst block as the endurance of flash blocks. On the other hand, blocks exhibit large endurance variations, which introduce two serious problems. The first problem is that the error correcting code (ECC) is often over-provisioned, as it has to be designed to tolerate the worst case to ensure data integrity, which causes longer decoding latency. The second problem is the underutilized block's lifespan due to conservatively defined block endurance. Raw bit error rate (RBER) of most blocks have not arrived the allowable RBER based on the nominal endurance point, which implies that the conventional P/E cycle-based block retirement policies may waste large flash storage space. In this paper, to exploit the storage capacity of each flash block, we propose an RBER-aware lifetime prediction scheme based on machine learning technologies. We consider the problem that the model can lose prediction effectiveness over time and use incremental learning to update the model for adapting the changes at different lifetime stages. At run time, trained data will be gradually discarded, which can reduce memory overhead. For evaluating our purpose, four well-known machine learning techniques have been compared in terms of predictive accuracy and time overhead under our proposed lifetime prediction scheme. We also compared the predicted values with the tested values obtained in the real NAND flash-based test platform, and the experimental results show that the support vector machine (SVM) models based on our proposed lifetime prediction scheme can achieve as high as 95% accuracy for flash blocks. We also apply our proposed lifetime prediction scheme to predict the actual endurance of flash blocks at four different retention times, and the experimental results show that it can significantly improve the maximum P/E cycle of flash blocks from 37.5% to 86.3% on average. Therefore, the proposed lifetime prediction scheme can provide a guide for block endurance prediction.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2019
Keywords
NAND flash, P/E cycle, retention time, RBER, machine learning
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-251293 (URN)10.1109/ACCESS.2019.2909567 (DOI)000465384700001 ()2-s2.0-85064569653 (Scopus ID)
Note

QC 20190510

Available from: 2019-05-10 Created: 2019-05-10 Last updated: 2019-06-11Bibliographically approved
Zhou, Y., Wu, F., Lu, Z., He, X., Huang, P. & Xie, C. (2019). SCORE: A Novel Scheme to Efficiently Cache Overlong ECCs in NAND Flash Memory. ACM Transactions on Architecture and Code Optimization (TACO), 15(4), Article ID 60.
Open this publication in new window or tab >>SCORE: A Novel Scheme to Efficiently Cache Overlong ECCs in NAND Flash Memory
Show others...
2019 (English)In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 15, no 4, article id 60Article in journal (Refereed) Published
Abstract [en]

Technology scaling and program/erase cycling result in an increasing bit error rate in NAND flash storage. Some solid state drives (SSDs) adopt overlong error correction codes (ECCs), whose redundancy size exceeds the spare area limit of flash pages, to protect user data for improved reliability and lifetime. However, the read performance is significantly degraded, because a logical data page and its ECC redundancy are stored in two flash pages. In this article, we find that caching ECCs has a large potential to reduce flash reads by achieving higher hit rates, compared to caching data. Then, we propose a novel scheme to efficiently cache overlong ECCs, called SCORE, to improve the SSD performance. Exceeding ECC redundancy (called ECC residues) of logically consecutive data pages are grouped into ECC pages. SCORE partitions RAM to cache both data pages and ECC pages in a workload-adaptive manner. Finally, we verify SCORE using extensive trace-driven simulations. The results show that SCORE obtains high ECC hit rates without sacrificing data hit rates, thus improving the read performance by an average of 22% under various workloads, compared to the state-of-the-art schemes.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
Keywords
Solid state drive, overlong ECC, cache partitioning
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-244128 (URN)10.1145/3291052 (DOI)000457136000021 ()2-s2.0-85061187710 (Scopus ID)
Note

QC 20190218

Available from: 2019-02-18 Created: 2019-02-18 Last updated: 2019-02-18Bibliographically approved
Wang, J., Guo, S., Chen, Z., Li, Y. & Lu, Z. (2018). A New Parallel CODEC Technique for CDMA NoCs. IEEE transactions on industrial electronics (1982. Print), 65(8), 6527-6537
Open this publication in new window or tab >>A New Parallel CODEC Technique for CDMA NoCs
Show others...
2018 (English)In: IEEE transactions on industrial electronics (1982. Print), ISSN 0278-0046, E-ISSN 1557-9948, Vol. 65, no 8, p. 6527-6537Article in journal (Refereed) Published
Abstract [en]

Code division multiple access (CDMA) network-on-chip (NoC) has been proposed for many-core systems due to its data transfer parallelism over communication channels. Consequently, coder-decoder (CODEC) module, which greatly impacts the performance of CDMA NoCs, attracted growing attention in recent years. In this paper, we propose a new parallel CODEC technique for CDMA NoCs. In general, by using a few simple logic circuits with small penalties in area and power, our new parallel (NPC) CODEC can execute the encoding/decoding process in parallel and thus reduce the data transfer latency. To reveal the benefits of our method for on-chip communication, we apply our NPC to CDMA NoCs and perform extensive experiments. From the results, we can find that our method outperforms existing parallel CODECs, such as Walsh-based parallel CODEC (WPC) and overloaded parallel CODEC (OPC). Specifically, it improves the critical point of communication latency (7.3% over WPC and 13.5% over OPC), reduces packet latency jitter by about 17.3% (against WPC) and 71.6% (against OPC), and improves energy efficiency by up to 41.2% (against WPC) and 59.2% (against OPC).

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2018
Keywords
Code division multiple access (CDMA), coder-decoder (CODEC), energy efficiency, network-on-chip (NoC), performance
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-226177 (URN)10.1109/TIE.2017.2786230 (DOI)000428902200050 ()2-s2.0-85039797002 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-16 Created: 2018-05-16 Last updated: 2018-10-19Bibliographically approved
Chen, X., Lei, Y., Lu, Z. & Chen, S. (2018). A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 26(10), 1953-1966
Open this publication in new window or tab >>A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition
2018 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 26, no 10, p. 1953-1966Article in journal (Refereed) Published
Abstract [en]

Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 220 points. First, a parallel Cooley-Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm(2) and the power consumption of 91.3 mW at 25 degrees C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal's performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89x speedups in comparison to two software-only solutions and two hardware-dedicated solutions.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
Fast Fourier transform (FFT), hardware accelerator, matrix transposition (MT), token-based task scheduling
National Category
Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-237108 (URN)10.1109/TVLSI.2018.2846688 (DOI)000446332500012 ()2-s2.0-85049432682 (Scopus ID)
Note

QC 20181023

Available from: 2018-10-23 Created: 2018-10-23 Last updated: 2018-10-23Bibliographically approved
Wang, Z., Chen, X., Lu, Z. & Guo, Y. (2018). Cache Access Fairness in 3D Mesh-Based NUCA. IEEE Access, 6, 42984-42996
Open this publication in new window or tab >>Cache Access Fairness in 3D Mesh-Based NUCA
2018 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 6, p. 42984-42996Article in journal (Refereed) Published
Abstract [en]

Given the increase in cache capacity over the past few decades, cache access effciency has come to play a critical role in determining system performance. To ensure effcient utilization of the cache resources, non-uniform cache architecture (NUCA) has been proposed to allow for a large capacity and a short access latency. With the support of networks-on-chip (NoC), NUCA is often employed to organize the last level cache. However, this method also hurts cache access fairness, which denotes the degree of non-uniformity for cache access latencies. This drop in fairness can result in an increased number of cache accesses with overhigh latency, which leads to a bottleneck in system performance. This paper investigates the cache access fairness in the context of NoC-based 3-D chip architecture, and provides new insights into 3-D architecture design. We propose fair-NUCA (F-NUCA), a co-design scheme intended to optimize cache access fairness. In F-NUCA, we strive to improve fairness by equalizing cache access latencies. To achieve this goal, the memory mapping and the channel width are both redistributed non-uniformly, thereby equalizing the non-contention and contention latencies, respectively. The experimental results reveal that F-NUCA can effectively improve cache access fairness. When F-NUCA is compared with the traditional static NUCA in a simulation with PARSEC benchmarks, the average reductions in average latency and latency standard deviation are 4.64%/9.38% for a 4 x 4 x 2 mesh network, as well as 6.31%/13.51% for a 4 x 4 x 4 mesh network. In addition, a 4.0%/ 6.4% improvement in system throughput can be achieved for the two scales of mesh networks, respectively.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Keywords
3D chip architecture, cache memory, memory architecture, memory mapping, multiprocessor interconnection networks, networks-on-chip, non-uniform cache architecture
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-240191 (URN)10.1109/ACCESS.2018.2862633 (DOI)000443905300001 ()2-s2.0-85050975554 (Scopus ID)
Note

QC 20181219

Available from: 2018-12-19 Created: 2018-12-19 Last updated: 2018-12-19Bibliographically approved
Wu, F., Zhu, Y., Xiong, Q., Lu, Z., Zhou, Y., Kong, W. & Xie, C. (2018). Characterizing 3D Charge Trap NAND Flash: Observations, Analyses and Applications. In: Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018: . Paper presented at 36th International Conference on Computer Design, ICCD 2018; Holiday Inn Orlando - Disney Springs Area Orlando; United States; 7 October 2018 through 10 October 2018 (pp. 381-388). Institute of Electrical and Electronics Engineers (IEEE), Article ID 8615714.
Open this publication in new window or tab >>Characterizing 3D Charge Trap NAND Flash: Observations, Analyses and Applications
Show others...
2018 (English)In: Proceedings - 2018 IEEE 36th International Conference on Computer Design, ICCD 2018, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 381-388, article id 8615714Conference paper, Published paper (Refereed)
Abstract [en]

In the 3D era, the Charge Trap (CT) NAND flash is employed by mainstream products, thus having a deep understanding of its characteristics is becoming increasingly crucial for designing flash-based systems. In this paper, to enable such understanding, we implement comprehensive experiments on advanced 3D CT NAND flash chips by developing an ARM and FPGA-based evaluation platform. Based on the experimental results, we first make distinct observations on the characteristics of 3D CT NAND flash, including its performance and reliability features. Then we give analyses of the observations from physical and circuit aspects. Finally, based on the unique characteristics of 3D CT NAND flash, suggestions to optimize the flash management algorithms in real applications are presented.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
Series
Proceedings IEEE International Conference on Computer Design, ISSN 1063-6404
Keywords
3D CT NAND flash, performance, reliability
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-245098 (URN)10.1109/ICCD.2018.00064 (DOI)000458293200053 ()2-s2.0-85061198581 (Scopus ID)978-1-5386-8477-1 (ISBN)
Conference
36th International Conference on Computer Design, ICCD 2018; Holiday Inn Orlando - Disney Springs Area Orlando; United States; 7 October 2018 through 10 October 2018
Note

QC 20190308

Available from: 2019-03-08 Created: 2019-03-08 Last updated: 2019-03-08Bibliographically approved
Xiong, Q., Wu, F., Lu, Z., Zhu, Y., Zhou, Y., Chu, Y., . . . Huang, P. (2018). Characterizing 3D Floating Gate NAND Flash: Observations, Analyses, and Implications. ACM Transactions on Storage, 14(2), Article ID 16.
Open this publication in new window or tab >>Characterizing 3D Floating Gate NAND Flash: Observations, Analyses, and Implications
Show others...
2018 (English)In: ACM Transactions on Storage, ISSN 1553-3077, E-ISSN 1553-3093, Vol. 14, no 2, article id 16Article in journal (Refereed) Published
Abstract [en]

As both NAND flash memory manufacturers and users are turning their attentions from planar architecture towards three-dimensional (3D) architecture, it becomes critical and urgent to understand the characteristics of 3D NAND flash memory. These characteristics, especially those different from planar NAND flash, can significantly affect design choices of flash management techniques. In this article, we present a characterization study on the state-of-the-art 3D floating gate (FG) NAND flash memory through comprehensive experiments on an FPGA-based 3D NAND flash evaluation platform. We make distinct observations on its performance and reliability, such as operation latencies and various error patterns, followed by careful analyses from physical and circuit-level perspectives. Although 3D FG NAND flash provides much higher storage densities than planar NAND flash, it faces new performance challenges of garbage collection overhead and program performance variations and more complicated reliability issues due to, e.g., distinct location dependence and value dependence of errors. We also summarize the differences between 3D FG NAND flash and planar NAND flash and discuss implications on the designs of NAND flash management techniques brought by the architecture innovation. We believe that our work will facilitate developing novel 3D FG NAND flash-oriented designs to achieve better performance and reliability.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
Keywords
3D floating gate NAND flash, MLC, error pattern
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-232795 (URN)10.1145/3162616 (DOI)000434635800005 ()
Note

QC 20180802

Available from: 2018-08-02 Created: 2018-08-02 Last updated: 2018-08-02Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0061-3475

Search in DiVA

Show all publications