Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 181) Show all publications
Wang, J., Guo, S., Chen, Z., Li, Y. & Lu, Z. (2018). A New Parallel CODEC Technique for CDMA NoCs. IEEE transactions on industrial electronics (1982. Print), 65(8), 6527-6537
Open this publication in new window or tab >>A New Parallel CODEC Technique for CDMA NoCs
Show others...
2018 (English)In: IEEE transactions on industrial electronics (1982. Print), ISSN 0278-0046, E-ISSN 1557-9948, Vol. 65, no 8, p. 6527-6537Article in journal (Refereed) Published
Abstract [en]

Code division multiple access (CDMA) network-on-chip (NoC) has been proposed for many-core systems due to its data transfer parallelism over communication channels. Consequently, coder-decoder (CODEC) module, which greatly impacts the performance of CDMA NoCs, attracted growing attention in recent years. In this paper, we propose a new parallel CODEC technique for CDMA NoCs. In general, by using a few simple logic circuits with small penalties in area and power, our new parallel (NPC) CODEC can execute the encoding/decoding process in parallel and thus reduce the data transfer latency. To reveal the benefits of our method for on-chip communication, we apply our NPC to CDMA NoCs and perform extensive experiments. From the results, we can find that our method outperforms existing parallel CODECs, such as Walsh-based parallel CODEC (WPC) and overloaded parallel CODEC (OPC). Specifically, it improves the critical point of communication latency (7.3% over WPC and 13.5% over OPC), reduces packet latency jitter by about 17.3% (against WPC) and 71.6% (against OPC), and improves energy efficiency by up to 41.2% (against WPC) and 59.2% (against OPC).

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2018
Keywords
Code division multiple access (CDMA), coder-decoder (CODEC), energy efficiency, network-on-chip (NoC), performance
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-226177 (URN)10.1109/TIE.2017.2786230 (DOI)000428902200050 ()2-s2.0-85039797002 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-16 Created: 2018-05-16 Last updated: 2018-10-19Bibliographically approved
Xiong, Q., Wu, F., Lu, Z., Zhu, Y., Zhou, Y., Chu, Y., . . . Huang, P. (2018). Characterizing 3D Floating Gate NAND Flash: Observations, Analyses, and Implications. ACM Transactions on Storage, 14(2), Article ID 16.
Open this publication in new window or tab >>Characterizing 3D Floating Gate NAND Flash: Observations, Analyses, and Implications
Show others...
2018 (English)In: ACM Transactions on Storage, ISSN 1553-3077, E-ISSN 1553-3093, Vol. 14, no 2, article id 16Article in journal (Refereed) Published
Abstract [en]

As both NAND flash memory manufacturers and users are turning their attentions from planar architecture towards three-dimensional (3D) architecture, it becomes critical and urgent to understand the characteristics of 3D NAND flash memory. These characteristics, especially those different from planar NAND flash, can significantly affect design choices of flash management techniques. In this article, we present a characterization study on the state-of-the-art 3D floating gate (FG) NAND flash memory through comprehensive experiments on an FPGA-based 3D NAND flash evaluation platform. We make distinct observations on its performance and reliability, such as operation latencies and various error patterns, followed by careful analyses from physical and circuit-level perspectives. Although 3D FG NAND flash provides much higher storage densities than planar NAND flash, it faces new performance challenges of garbage collection overhead and program performance variations and more complicated reliability issues due to, e.g., distinct location dependence and value dependence of errors. We also summarize the differences between 3D FG NAND flash and planar NAND flash and discuss implications on the designs of NAND flash management techniques brought by the architecture innovation. We believe that our work will facilitate developing novel 3D FG NAND flash-oriented designs to achieve better performance and reliability.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
Keywords
3D floating gate NAND flash, MLC, error pattern
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-232795 (URN)10.1145/3162616 (DOI)000434635800005 ()
Note

QC 20180802

Available from: 2018-08-02 Created: 2018-08-02 Last updated: 2018-08-02Bibliographically approved
Long, Y., Lu, Z. & Shen, H. (2018). Composable Worst-Case Delay Bound Analysis Using Network Calculus. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(3), 705-709
Open this publication in new window or tab >>Composable Worst-Case Delay Bound Analysis Using Network Calculus
2018 (English)In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 37, no 3, p. 705-709Article in journal (Refereed) Published
Abstract [en]

Performance analysis is playing an indispensable role in design and evaluation for on-chip networks. In former studies, the end-to-end delay bound is calculated by the equivalent service curve method based on network calculus when resource sharing happens. However, in this paper, we propose a composable method to get the bound. This method uses the aggregated local arrival curve to get the local delay bound first, then calculates the end-to-end bound by summing up local bounds. This method solves the scalability problem and largely decreases the computation complexity compared with the former method.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2018
Keywords
Composable method, delay bound, local arrival curve (LAC), network calculus (NC)
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-224006 (URN)10.1109/TCAD.2017.2729283 (DOI)000425674700015 ()2-s2.0-85028811873 (Scopus ID)
Note

QC 20180323

Available from: 2018-03-23 Created: 2018-03-23 Last updated: 2018-05-24Bibliographically approved
Yao, Y. & Lu, Z. (2018). INPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA): . Paper presented at 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Hotel Pyramide Congress Center, Vienna, Austria, 24 February 2018 through 28 February 2018 (pp. 15-26). IEEE Computer Society
Open this publication in new window or tab >>INPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores
2018 (English)In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE Computer Society, 2018, p. 15-26Conference paper, Published paper (Refereed)
Abstract [en]

As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive 'normal' NoC routers which only transmit packets into active 'big' ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.

Place, publisher, year, edition, pages
IEEE Computer Society, 2018
Series
International Symposium on High-Performance Computer Architecture-Proceedings, ISSN 1530-0897
Keywords
Cache Coherency, CMP, Critical Section, In Network Packet Generation, Network on Chip, Synchronisation Primitive
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-228571 (URN)10.1109/HPCA.2018.00012 (DOI)000440297700002 ()2-s2.0-85046805697 (Scopus ID)9781538636596 (ISBN)
Conference
24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Hotel Pyramide Congress Center, Vienna, Austria, 24 February 2018 through 28 February 2018
Note

QC 20180528

Available from: 2018-05-28 Created: 2018-05-28 Last updated: 2018-08-16Bibliographically approved
Shi, X., Wu, F., Wang, S., Xie, C. & Lu, Z. (2018). Program Error Rate-based Wear Leveling for NAND Hash Memory. In: PROCEEDINGS OF THE 2018 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE): . Paper presented at Design, Automation and Test in Europe Conference and Exhibition (DATE), MAR 19-23, 2018, Dresden, GERMANY (pp. 1241-1246). IEEE
Open this publication in new window or tab >>Program Error Rate-based Wear Leveling for NAND Hash Memory
Show others...
2018 (English)In: PROCEEDINGS OF THE 2018 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), IEEE , 2018, p. 1241-1246Conference paper, Published paper (Refereed)
Abstract [en]

Wear leveling scheme has became a fundamental issue in the design of Solid State Disk (SSD) based on NAND Flash memory. Existing schemes aim to equalize the number of programming/erase (P/E) cycles and memory raw bit error rates (BER) among all the flash blocks. However, due to fabrication process variation, different blocks of the same flash chip usually have largely different endurance in terns of BER and program error rate (PER). Such conventional design cannot obtain the wear status of flash blocks precisely. This paper proposes PER WE, an efficient PER-based wear leveling scheme that uses PER statistics as the measurement of Hash block wear-out pace, and performs block data swapping to improve the wear leveling efficiency. In our evaluation with four realistic workloads, PER based wear leveling scheme can achieve 17% and 9% variance of program error rate reduction, 8% and 3% program error rate reduction with 5% and 2% system performance degradation when compared to two state-of-the-art wear leveling schemes on average.

Place, publisher, year, edition, pages
IEEE, 2018
Series
Design Automation and Test in Europe Conference and Exhibition, ISSN 1530-1591
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-231649 (URN)000435148800236 ()978-3-9819-2630-9 (ISBN)
Conference
Design, Automation and Test in Europe Conference and Exhibition (DATE), MAR 19-23, 2018, Dresden, GERMANY
Note

QC 20180904

Available from: 2018-09-04 Created: 2018-09-04 Last updated: 2018-10-19Bibliographically approved
Lu, Z. & Zhao, X. (2018). xMAS-Based QoS Analysis Methodology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(2), 364-377
Open this publication in new window or tab >>xMAS-Based QoS Analysis Methodology
2018 (English)In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 37, no 2, p. 364-377Article in journal (Refereed) Published
Abstract [en]

On-chip communication system design starting from a high-level model can facilitate formal verification of system properties, such as safety and deadlock freedom. Yet, analyzing its quality-of-service (QoS) property, in our context, per-flow delay bound, is an open challenge. Based on executable micro-architectural specification (xMAS) which is a formal framework modeling communication fabrics, we first present how to model a classic input-queuing virtual channel router using the xMAS primitives and then a QoS analysis methodology using network calculus (NC). Thanks to the precise semantics of the xMAS primitives, the router can be modeled in different variants, which cannot be otherwise captured by normal ad hoc box diagrams. The analysis methodology consists of three steps: 1) given network and flow knowledge, we first create a well-defined precise xMAS model for a specific application on a concrete on-chip network; 2) the specific xMAS model is then mapped to an NC graph (NCG) following a set of mapping rules; and 3) finally, existing QoS analysis techniques can be applied to analyze the NCG to obtain end-to-end delay bound per flow. We also show how to apply the technique to a typical all-to-one communication pattern on a binary-tree network and conduct an SoC case study, exemplifying the step-by-step analysis procedure and discussing the tightness of the results.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2018
Keywords
Design methodology, executable micro-architectural specification (xMAS), network calculus (NC), network-on-chip (NoC), quality-of-service (QoS)
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-222173 (URN)10.1109/TCAD.2017.2706561 (DOI)000422948500008 ()2-s2.0-85040972387 (Scopus ID)
Note

QC 20180207

Available from: 2018-02-07 Created: 2018-02-07 Last updated: 2018-03-13Bibliographically approved
Zhao, X. & Lu, Z. (2017). A Tool for xMAS-Based Modeling and Analysis of Communication Fabrics in Simulink. ACM Transactions on Modeling and Computer Simulation, 27(3), Article ID 16.
Open this publication in new window or tab >>A Tool for xMAS-Based Modeling and Analysis of Communication Fabrics in Simulink
2017 (English)In: ACM Transactions on Modeling and Computer Simulation, ISSN 1049-3301, E-ISSN 1558-1195, Vol. 27, no 3, article id 16Article in journal (Refereed) Published
Abstract [en]

The eXecutable Micro-Architectural Specification (xMAS) language developed in recent years finds an effective way to model on-chip communication fabrics and enables performance-bound analysis with network calculus at the micro-architectural level. For network-on-Chip (NoC) performance analysis, model validation is essential to ensure correctness and accuracy. In order to facilitate the xMAS modeling and corresponding analysis validation, this work presents a unified platform based on xMAS in Simulink. The platform provides a friendly graphical user interface for xMAS modeling and parameter setup by taking advantages of the Simulink modeling environment. The regulator and latency-rate sever are added to the xMAS primitive set to support typical flow and service behaviors. Hierarchical model build-up and Verilog-HDL code generation are essentially supported to manage complex models and to conduct cycle-accurate bit-accurate simulations. Based on the generated simulation models of xMAS, this tool is applied to evaluate the tightness of analytical delay bound results. We demonstrate the application as well as the work flow of the xMAS tool through a two-agent communication example and an all-to-one communication example with a tree topology.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2017
Keywords
Performance analysis, simulink, network calculus, network on chip
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-215460 (URN)10.1145/3005446 (DOI)000411266600001 ()2-s2.0-85028652385 (Scopus ID)
Note

QC 20171018

Available from: 2017-10-18 Created: 2017-10-18 Last updated: 2018-01-13Bibliographically approved
Wang, J., Chen, Z., Guo, J., Li, Y. & Lu, Z. (2017). ACO-Based Thermal-Aware Thread-to-Core Mapping for Dark-Silicon-Constrained CMPs. IEEE Transactions on Electron Devices, 64(3), 930-937
Open this publication in new window or tab >>ACO-Based Thermal-Aware Thread-to-Core Mapping for Dark-Silicon-Constrained CMPs
Show others...
2017 (English)In: IEEE Transactions on Electron Devices, ISSN 0018-9383, E-ISSN 1557-9646, Vol. 64, no 3, p. 930-937Article in journal (Refereed) Published
Abstract [en]

The limitation on thermal budget in chip multiprocessor (CMP) results in a fraction of inactive silicon regions called dark silicon, which significantly impacts the system performance. In this paper, we propose a thread-to-core mapping method for dark-silicon-constrainedCMPs to address their thermal issue. We first propose a thermal predictionmodel to forecast CMP temperature after the CMP executes a forthcoming application. Then, we develop an ant colony optimization-based algorithm to conduct the thread-to- core mapping process, such that the CMP peak temperature is minimized and, consequently, the probability of triggering CMP dynamic thermal management is decreased. Finally, we evaluate our method and compare it with the baseline (a standard Linux scheduler) and other existing methods (NoC-Sprinting, DaSiM mapping, and TP mapping). The simulation results show that our method gains good thermal profile and computational performance, and performs well with chip scaling. Specifically, it eliminates all thermal emergency time, outperforming all other methods, and gains million instructions per second improvement up to 12.9% against the baseline.

Place, publisher, year, edition, pages
IEEE Press, 2017
Keywords
Chip multiprocessor (CMP), dark silicon, thermal model, thread-to-core mapping
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-204059 (URN)10.1109/TED.2017.2653838 (DOI)000396056700030 ()2-s2.0-85011294756 (Scopus ID)
Note

QC 20170330

Available from: 2017-03-30 Created: 2017-03-30 Last updated: 2018-01-13Bibliographically approved
Lu, Z. & Yao, Y. (2017). Dynamic Traffic Regulation in NoC-Based Systems. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 25(2), 556-569
Open this publication in new window or tab >>Dynamic Traffic Regulation in NoC-Based Systems
2017 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 25, no 2, p. 556-569Article in journal (Refereed) Published
Abstract [en]

In network-on-chip (NoC)-based systems, performance enhancement has primarily focused on the network itself, with little attention paid on controlling traffic injection at the network boundary. This is unsatisfactory because traffic may be over injected, aggravating congestion, and lowering performance. Recently, traffic regulation is proposed as an orthogonal means for performance improvement. Rather than as soon as possible admission, traffic regulation may hold back packet injection by admitting packets into the network only when the accumulated traffic volume at any time interval does not exceed a threshold. These regulation techniques are, however, often static, likely causing overregulation and underregulation. We propose dynamic traffic regulation to improve the system performance for NoC-based multi/many-processor systemson- chip (MPSoC) and chip multi/many-core processor (CMP) designs. It can be applied to MPSoCs for intellectual property integration in an open-loop fashion by injecting traffic according to its run-time profiled characteristics. It can also be applied to CMPs in a closed-loop fashion by admitting traffic fully adaptive to the traffic and network states. Through extensive experiments and results, we show that both the open-loop and closed-loop dynamic regulation techniques can significantly improve the network and system performance.

Place, publisher, year, edition, pages
IEEE Press, 2017
Keywords
Chip multi/many-core processor (CMP), fuzzy control, multi/many-processor systems-on-chip (MPSoC), network-on-chip (NoC), traffic engineering
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-204100 (URN)10.1109/TVLSI.2016.2584781 (DOI)000394593300015 ()2-s2.0-84979735738 (Scopus ID)
Note

QC 20170329

Available from: 2017-03-29 Created: 2017-03-29 Last updated: 2017-11-29Bibliographically approved
Xiong, Q., Wu, F., Lu, Z. & Xie, C. (2017). Extending Real-Time Analysis for Wormhole NoCs. I.E.E.E. transactions on computers (Print), 66(9), 1532-1546, Article ID 7884964.
Open this publication in new window or tab >>Extending Real-Time Analysis for Wormhole NoCs
2017 (English)In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 66, no 9, p. 1532-1546, article id 7884964Article in journal (Refereed) Published
Abstract [en]

The delay upper-bound analysis problem is of fundamental importance to real-Time applications in Network-on-Chips (NoCs). In the paper, we revisit two state-of-The-Art analysis models for real-Time communication in wormhole NoCs with priority-based preemptive arbitration and show that the models only support specific router architectures with large buffer sizes. We then propose an extended analysis model to estimate delay upper-bounds for all router architectures and buffer sizes by identifying and analyzing the differences between upstream and downstream indirect interferences according to the relative positions of traffic flows and taking the buffer influence into consideration. Simulated evaluations show that our model supports one more router architecture and applies to small buffer sizes compared to the previous models.

Place, publisher, year, edition, pages
IEEE Computer Society, 2017
Keywords
delay, real-Time communication, Wormhole NoC, Computer architecture, Network architecture, Routers, Extended analysis, Real time analysis, Real-time application, Relative positions, Router architecture, Upper bound analysis, Network-on-chip
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-216210 (URN)10.1109/TC.2017.2686391 (DOI)000407449400006 ()2-s2.0-85029510717 (Scopus ID)
Note

QC 20171218

Available from: 2017-12-18 Created: 2017-12-18 Last updated: 2018-01-13Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0061-3475

Search in DiVA

Show all publications