Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 198) Show all publications
Qin, Z., Zhu, D., Zhu, X., Chen, X., Shi, Y., Gao, Y., . . . Pan, H. (2019). Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights. ELECTRONICS, 8(1), Article ID 78.
Open this publication in new window or tab >>Accelerating Deep Neural Networks by Combining Block-Circulant Matrices and Low-Precision Weights
Show others...
2019 (English)In: ELECTRONICS, ISSN 2079-9292, Vol. 8, no 1, article id 78Article in journal (Refereed) Published
Abstract [en]

As a key ingredient of deep neural networks (DNNs), fully-connected (FC) layers are widely used in various artificial intelligence applications. However, there are many parameters in FC layers, so the efficient process of FC layers is restricted by memory bandwidth. In this paper, we propose a compression approach combining block-circulant matrix-based weight representation and power-of-two quantization. Applying block-circulant matrices in FC layers can reduce the storage complexity from <mml:semantics>O(k2)</mml:semantics> to <mml:semantics>O(k)</mml:semantics>. By quantizing the weights into integer powers of two, the multiplications in the reference can be replaced by shift and add operations. The memory usages of models for MNIST, CIFAR-10 and ImageNet can be compressed by <mml:semantics>171x</mml:semantics>, <mml:semantics>2731x</mml:semantics> and <mml:semantics>128x</mml:semantics> with minimal accuracy loss, respectively. A configurable parallel hardware architecture is then proposed for processing the compressed FC layers efficiently. Without multipliers, a block matrix-vector multiplication module (B-MV) is used as the computing kernel. The architecture is flexible to support FC layers of various compression ratios with small footprint. Simultaneously, the memory access can be significantly reduced by using the configurable architecture. Measurement results show that the accelerator has a processing power of 409.6 GOPS, and achieves 5.3 TOPS/W energy efficiency at 800 MHz.

Place, publisher, year, edition, pages
MDPI, 2019
Keywords
hardware acceleration, deep neural networks (DNNs), fully-connected layers, network compression, VLSI
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-244134 (URN)10.3390/electronics8010078 (DOI)000457142800078 ()2-s2.0-85060368656 (Scopus ID)
Note

QC 20190218

Available from: 2019-02-18 Created: 2019-02-18 Last updated: 2019-03-18Bibliographically approved
Chen, Q., Fu, Y., Song, W., Cheng, K., Lu, Z., Zhang, C. & Li, L. (2019). An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks. ELECTRONICS, 8(4), Article ID 371.
Open this publication in new window or tab >>An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks
Show others...
2019 (English)In: ELECTRONICS, ISSN 2079-9292, Vol. 8, no 4, article id 371Article in journal (Refereed) Published
Abstract [en]

Convolutional Neural Networks (CNNs) have been widely applied in various fields, such as image recognition, speech processing, as well as in many big-data analysis tasks. However, their large size and intensive computation hinder their deployment in hardware, especially on the embedded systems with stringent latency, power, and area requirements. To address this issue, low bit-width CNNs are proposed as a highly competitive candidate. In this paper, we propose an efficient, scalable accelerator for low bit-width CNNs based on a parallel streaming architecture. With a novel coarse grain task partitioning (CGTP) strategy, the proposed accelerator with heterogeneous computing units, supporting multi-pattern dataflows, can nearly double the throughput for various CNN models on average. Besides, a hardware-friendly algorithm is proposed to simplify the activation and quantification process, which can reduce the power dissipation and area overhead. Based on the optimized algorithm, an efficient reconfigurable three-stage activation-quantification-pooling (AQP) unit with the low power staged blocking strategy is developed, which can process activation, quantification, and max-pooling operations simultaneously. Moreover, an interleaving memory scheduling scheme is proposed to well support the streaming architecture. The accelerator is implemented with TSMC 40 nm technology with a core size of . It can achieve TOPS/W energy efficiency and area efficiency at 100.1mW, which makes it a promising design for the embedded devices.

Place, publisher, year, edition, pages
MDPI, 2019
Keywords
low bit-width convolutional neural networks, parallel streaming architecture, coarse grain task partitioning, reconfigurable, VLSI
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-252646 (URN)10.3390/electronics8040371 (DOI)000467751100002 ()2-s2.0-85064599789 (Scopus ID)
Note

QC 20190610

Available from: 2019-06-10 Created: 2019-06-10 Last updated: 2019-06-10Bibliographically approved
Wang, B., Lu, Z. & Chen, S. (2019). ANN Based Admission Control for On-Chip Networks. In: PROCEEDINGS OF THE 2019 56TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC): . Paper presented at 56th Annual Design Automation Conference, DAC 2019, Las Vegas, 2-6 June 2019. ASSOC COMPUTING MACHINERY
Open this publication in new window or tab >>ANN Based Admission Control for On-Chip Networks
2019 (English)In: PROCEEDINGS OF THE 2019 56TH ACM/EDAC/IEEE DESIGN AUTOMATION CONFERENCE (DAC), ASSOC COMPUTING MACHINERY , 2019Conference paper, Published paper (Refereed)
Abstract [en]

We propose an admission control method in Network-on-Chip (NoC) with a centralized Artificial Neural Network (ANN) admission controller, which can improve system performance by predicting the most appropriate injection rate of each node via the network performance information. In the online control process, a data preprocessing unit is applied to simplify the ANN architecture and make the prediction results more accurate. Based on the preprocessed information, the ANN predictor determines the control strategy and broadcasts it to each node where the admission control will be applied. Compared to the previous work, our method builds up a high-fidelity model between the network status and the injection rate regulation. The full-system simulation results show that our proposed method can enhance application performance by 17.8% on average and up to 23.8%.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
National Category
Telecommunications
Identifiers
urn:nbn:se:kth:diva-259467 (URN)10.1145/3316781.3317772 (DOI)000482058200046 ()2-s2.0-85067806732 (Scopus ID)
Conference
56th Annual Design Automation Conference, DAC 2019, Las Vegas, 2-6 June 2019
Note

QC 20190920

Available from: 2019-09-20 Created: 2019-09-20 Last updated: 2019-09-20Bibliographically approved
Zhang, W., Cao, Q. & Lu, Z. (2019). Bit-Flipping Schemes Upon MLC Flash: Investigation, Implementation, and Evaluation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(4), 780-784
Open this publication in new window or tab >>Bit-Flipping Schemes Upon MLC Flash: Investigation, Implementation, and Evaluation
2019 (English)In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 38, no 4, p. 780-784Article in journal (Refereed) Published
Abstract [en]

Multilevel cell (MLC) stales with lower threshold voltage endure less cell damage, lower retention error, and less current consumption. Based on these characteristics, it is opportunistic to strengthen MLC flash by introducing hit-flipping that reshapes state proportions on MLC pages. In this paper. we present a holistic study of bit-flipping schemes upon MLC flash in theory and practice. Specifically, we systematically investigate effective bit-flipping schemes and propose four new schemes on manipulating MLC states. We further design a generic implementation framework, named MLC bit-flipping framework, to implement bit-flipping schemes within solid state drives controllers, nicely integrating with existing system-level optimizations to further improve overall performance. The experimental results demonstrate that our proposed bit-flipping schemes standalone can reduce up to 28% cell damages and 53% retention errors. Our circuit-level simulation manifests that the bit-flipping latency on a page is less than 4 mu s when using 8K logic gates.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
Bit-flipping, lifetime extension, multilevel cell (MLC) flash, retention error reduction, state dependent damage
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-249871 (URN)10.1109/TCAD.2018.2818693 (DOI)000462370000016 ()2-s2.0-85044344391 (Scopus ID)
Note

QC 20190424

Available from: 2019-04-24 Created: 2019-04-24 Last updated: 2019-04-24Bibliographically approved
Fu, Y., Chen, Q., He, G., Chen, K., Lu, Z., Zhang, C. & Li, L. (2019). Congestion-Aware Dynamic Elevator Assignment for Partially Connected 3D-NoCs. In: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS): . Paper presented at IEEE International Symposium on Circuits and Systems (IEEE ISCAS), MAY 26-29, 2019, Sapporo, JAPAN. IEEE
Open this publication in new window or tab >>Congestion-Aware Dynamic Elevator Assignment for Partially Connected 3D-NoCs
Show others...
2019 (English)In: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), IEEE , 2019Conference paper, Published paper (Refereed)
Abstract [en]

The combination of Network-on-Chips (NoCs) and 3D IC technology, 3D NoCs, has been proven to be able to achieve a great improvement in both network performance and power consumption compared to 2D NoCs. In the traditional 3D NoC, all routers are vertically connected. Due to the large overhead of Through-Silicon-Via (TSV, e.g., low fabrication yield and the occupied silicon area), the partially connected 3D NoC has emerged. The assignment method determines the traffic loads of the vertical links (elevators), thus has a great impact on 3D-NoCs' performance. In this paper, we propose a congestion-aware dynamic elevator assignment (CDA) scheme, which takes both the distance factors and network congestion information into account. Experiments show that the performance of the proposed CDA scheme is improved by 67% to 87% compared to the random selection scheme, 8% to 25% compared to SelByDis-1, and 13% to 18% compared to SelByDis-2.

Place, publisher, year, edition, pages
IEEE, 2019
Series
IEEE International Symposium on Circuits and Systems, ISSN 0271-4302
Keywords
3D NoC, TSV, assignment, congestion
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-260224 (URN)10.1109/ISCAS.2019.8702434 (DOI)000483076401111 ()2-s2.0-85066814567 (Scopus ID)978-1-7281-0397-6 (ISBN)
Conference
IEEE International Symposium on Circuits and Systems (IEEE ISCAS), MAY 26-29, 2019, Sapporo, JAPAN
Note

QC 20190930

Available from: 2019-09-30 Created: 2019-09-30 Last updated: 2019-09-30Bibliographically approved
Ma, R., Wu, F., Zhang, M., Lu, Z., Wan, J. & Xie, C. (2019). RBER-Aware Lifetime Prediction Scheme for 3D-TLC NAND Flash Memory. IEEE Access, 7, 44696-44708
Open this publication in new window or tab >>RBER-Aware Lifetime Prediction Scheme for 3D-TLC NAND Flash Memory
Show others...
2019 (English)In: IEEE Access, E-ISSN 2169-3536, Vol. 7, p. 44696-44708Article in journal (Refereed) Published
Abstract [en]

NAND flash memory is widely used in various computing systems. However, flash blocks can sustain only a limited number of program/erase (P/E) cycles, which are referred to as the endurance. On one hand, in order to ensure data integrity, flash manufacturers often define the maximum P/E cycles of the worst block as the endurance of flash blocks. On the other hand, blocks exhibit large endurance variations, which introduce two serious problems. The first problem is that the error correcting code (ECC) is often over-provisioned, as it has to be designed to tolerate the worst case to ensure data integrity, which causes longer decoding latency. The second problem is the underutilized block's lifespan due to conservatively defined block endurance. Raw bit error rate (RBER) of most blocks have not arrived the allowable RBER based on the nominal endurance point, which implies that the conventional P/E cycle-based block retirement policies may waste large flash storage space. In this paper, to exploit the storage capacity of each flash block, we propose an RBER-aware lifetime prediction scheme based on machine learning technologies. We consider the problem that the model can lose prediction effectiveness over time and use incremental learning to update the model for adapting the changes at different lifetime stages. At run time, trained data will be gradually discarded, which can reduce memory overhead. For evaluating our purpose, four well-known machine learning techniques have been compared in terms of predictive accuracy and time overhead under our proposed lifetime prediction scheme. We also compared the predicted values with the tested values obtained in the real NAND flash-based test platform, and the experimental results show that the support vector machine (SVM) models based on our proposed lifetime prediction scheme can achieve as high as 95% accuracy for flash blocks. We also apply our proposed lifetime prediction scheme to predict the actual endurance of flash blocks at four different retention times, and the experimental results show that it can significantly improve the maximum P/E cycle of flash blocks from 37.5% to 86.3% on average. Therefore, the proposed lifetime prediction scheme can provide a guide for block endurance prediction.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2019
Keywords
NAND flash, P/E cycle, retention time, RBER, machine learning
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-251293 (URN)10.1109/ACCESS.2019.2909567 (DOI)000465384700001 ()2-s2.0-85064569653 (Scopus ID)
Note

QC 20190510

Available from: 2019-05-10 Created: 2019-05-10 Last updated: 2019-06-11Bibliographically approved
Zhou, Y., Wu, F., Lu, Z., He, X., Huang, P. & Xie, C. (2019). SCORE: A Novel Scheme to Efficiently Cache Overlong ECCs in NAND Flash Memory. ACM Transactions on Architecture and Code Optimization (TACO), 15(4), Article ID 60.
Open this publication in new window or tab >>SCORE: A Novel Scheme to Efficiently Cache Overlong ECCs in NAND Flash Memory
Show others...
2019 (English)In: ACM Transactions on Architecture and Code Optimization (TACO), ISSN 1544-3566, E-ISSN 1544-3973, Vol. 15, no 4, article id 60Article in journal (Refereed) Published
Abstract [en]

Technology scaling and program/erase cycling result in an increasing bit error rate in NAND flash storage. Some solid state drives (SSDs) adopt overlong error correction codes (ECCs), whose redundancy size exceeds the spare area limit of flash pages, to protect user data for improved reliability and lifetime. However, the read performance is significantly degraded, because a logical data page and its ECC redundancy are stored in two flash pages. In this article, we find that caching ECCs has a large potential to reduce flash reads by achieving higher hit rates, compared to caching data. Then, we propose a novel scheme to efficiently cache overlong ECCs, called SCORE, to improve the SSD performance. Exceeding ECC redundancy (called ECC residues) of logically consecutive data pages are grouped into ECC pages. SCORE partitions RAM to cache both data pages and ECC pages in a workload-adaptive manner. Finally, we verify SCORE using extensive trace-driven simulations. The results show that SCORE obtains high ECC hit rates without sacrificing data hit rates, thus improving the read performance by an average of 22% under various workloads, compared to the state-of-the-art schemes.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2019
Keywords
Solid state drive, overlong ECC, cache partitioning
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-244128 (URN)10.1145/3291052 (DOI)000457136000021 ()2-s2.0-85061187710 (Scopus ID)
Note

QC 20190218

Available from: 2019-02-18 Created: 2019-02-18 Last updated: 2019-02-18Bibliographically approved
Guo, S., Wang, J., Chen, Z., Lu, Z., Guo, J. & Yang, L. (2019). Security-Aware Task Mapping Reducing Thermal Side Channel Leakage in CMPs. IEEE Transactions on Industrial Informatics, 15(10), 5435-5443
Open this publication in new window or tab >>Security-Aware Task Mapping Reducing Thermal Side Channel Leakage in CMPs
Show others...
2019 (English)In: IEEE Transactions on Industrial Informatics, ISSN 1551-3203, E-ISSN 1941-0050, Vol. 15, no 10, p. 5435-5443Article in journal (Refereed) Published
Abstract [en]

Chip multiprocessor (CMP) suffers from growing threats on hardware security in recent years, such as side channel attack, hardware Trojan infection, chip clone, etc. In this paper, we propose a security-aware (SA) task mapping method to reduce the information leakage from CMP thermal side channel. First, we construct a mathematical function that can estimate the CMP security cost corresponding to a given mapping result. Then, we develop a greedy mapping algorithm that automatically allocates all threads of an application to a set of proper cores, such that the total security cost is optimized. Finally, we perform extensive experiments to evaluate our method. The experimental results show that our SA mapping effectively decreases the CMP side channel leakage. Compared to the two existing task mapping methods, Linux scheduler (LS; a standard Linux scheduler) and NoC-Sprinting (NS; a thermal-aware mapping technique), our method reduces side-channel vulnerability factor by up to 19 & x0025; and 7 & x0025;, respectively. Moreover, our method also gains higher computational efficiency, with improvement in million instructions per second achieving up to 100 & x0025; against NS and up to 33 & x0025; against LS.

Place, publisher, year, edition, pages
IEEE, 2019
Keywords
Task analysis, Hardware, Informatics, Side-channel attacks, Temperature measurement, Instruction sets, Chip-multi processor (CMP), hardware security, task mapping, thermal side channel
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-263683 (URN)10.1109/TII.2019.2904092 (DOI)000492292500005 ()2-s2.0-85073594090 (Scopus ID)
Note

QC 20191108

Available from: 2019-11-08 Created: 2019-11-08 Last updated: 2019-11-08Bibliographically approved
Chen, Q., Fu, Y., Cheng, K., Song, W., Lu, Z., Li, L. & Zhang, C. (2019). Smilodon: An Efficient Accelerator for Low Bit-Width CNNs with Task Partitioning. In: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS): . Paper presented at IEEE International Symposium on Circuits and Systems (IEEE ISCAS), MAY 26-29, 2019, Sapporo, JAPAN. IEEE
Open this publication in new window or tab >>Smilodon: An Efficient Accelerator for Low Bit-Width CNNs with Task Partitioning
Show others...
2019 (English)In: 2019 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS), IEEE , 2019Conference paper, Published paper (Refereed)
Abstract [en]

Convolutional Neural Networks (CNNs) have been widely applied in various fields such as image and video recognition, recommender systems, and natural language processing. However, the massive size and intensive computation loads prevent its feasible deployment in practice, especially on the embedded systems. As a highly competitive candidate, low bit-width CNNs are proposed to enable efficient implementation. In this paper, we propose Smilodon, a scalable, efficient accelerator for low bit-width CNNs based on a parallel streaming architecture, optimized with a task partitioning strategy. We also present the 3D systolic-like computing arrays fitting for convolutional layers. Our design is implemented on Zynq XC7ZO20 FPGA, which can satisfy the needs of real-time with a frame rate of 1, 622 FPS throughput, while consuming 2.1 Watt. To the best of our knowledge, our accelerator is superior to the state-of-the-art works in the tradeoff among throughput, power efficiency, and area efficiency.

Place, publisher, year, edition, pages
IEEE, 2019
Series
IEEE International Symposium on Circuits and Systems, ISSN 0271-4302
Keywords
Low bit-width CNNs, 3D systolic-like array, task partitioning, parallel streaming architecture
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-260226 (URN)10.1109/ISCAS.2019.8702547 (DOI)000483076402015 ()2-s2.0-85066787463 (Scopus ID)978-1-7281-0397-6 (ISBN)
Conference
IEEE International Symposium on Circuits and Systems (IEEE ISCAS), MAY 26-29, 2019, Sapporo, JAPAN
Note

QC 20190927

Available from: 2019-09-27 Created: 2019-09-27 Last updated: 2019-09-27Bibliographically approved
Chen, Z., Guo, S., Wang, J., Li, Y. & Lu, Z. (2019). Toward FPGA Security in IoT: A New Detection Technique for Hardware Trojans. IEEE Internet of Things Journal, 6(4), 7061-7068
Open this publication in new window or tab >>Toward FPGA Security in IoT: A New Detection Technique for Hardware Trojans
Show others...
2019 (English)In: IEEE Internet of Things Journal, ISSN 2327-4662, Vol. 6, no 4, p. 7061-7068Article in journal (Refereed) Published
Abstract [en]

Nowadays, field programmable gate array (FPGA) has been widely used in Internet of Things (IoT) since it can provide flexible and scalable solutions to various IoT requirements. Meanwhile, hardware Trojan (HT), which may lead to undesired chip function or leak sensitive information, has become a great challenge for FPGA security. Therefore, distinguishing the Trojan-infected FPGAs is quite crucial for reinforcing the security of IoT. To achieve this goal, we propose a clock-tree-concerned technique to detect the HTs on FPGA. First, we present an experimental framework which helps us to collect the electromagnetic (EM) radiation emitted by FPGA clock tree. Then, we propose a Trojan identifying approach which extracts the mathematical feature of obtained EM traces, i.e., 2-D principal component analysis (2DPCA) in this paper, and automatically isolates the Trojan-infected FPGAs from the Trojan-free ones by using a BP neural network. Finally, we perform extensive experiments to evaluate the effectiveness of our method. The results reveal that our approach is valid in detecting HTs on FPGA. Specifically, for the trust-hub benchmarks, we can find out the FPGA with always on Trojans (100% detection rate) while identifying the triggered Trojans with high probability (by up to 92%). In addition, we give a thorough discussion on how the experimental setup, such as probe step size, scanning area, and chip ambient temperature, affects the Trojan detection rate.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2019
Keywords
Electromagnetic (EM) side channel, field programmable gate array (FPGA), hardware Trojan (HT) detection, Internet of Things (IoT) security
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-257568 (URN)10.1109/JIOT.2019.2914079 (DOI)000478957600108 ()2-s2.0-85070241350 (Scopus ID)
Note

QC 20190923

Available from: 2019-09-23 Created: 2019-09-23 Last updated: 2019-10-15Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0061-3475

Search in DiVA

Show all publications