Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 179) Show all publications
Wang, J., Guo, S., Chen, Z., Li, Y. & Lu, Z. (2018). A New Parallel CODEC Technique for CDMA NoCs. IEEE transactions on industrial electronics (1982. Print), 65(8), 6527-6537
Open this publication in new window or tab >>A New Parallel CODEC Technique for CDMA NoCs
Show others...
2018 (English)In: IEEE transactions on industrial electronics (1982. Print), ISSN 0278-0046, E-ISSN 1557-9948, Vol. 65, no 8, p. 6527-6537Article in journal (Refereed) Published
Abstract [en]

Code division multiple access (CDMA) network-on-chip (NoC) has been proposed for many-core systems due to its data transfer parallelism over communication channels. Consequently, coder-decoder (CODEC) module, which greatly impacts the performance of CDMA NoCs, attracted growing attention in recent years. In this paper, we propose a new parallel CODEC technique for CDMA NoCs. In general, by using a few simple logic circuits with small penalties in area and power, our new parallel (NPC) CODEC can execute the encoding/decoding process in parallel and thus reduce the data transfer latency. To reveal the benefits of our method for on-chip communication, we apply our NPC to CDMA NoCs and perform extensive experiments. From the results, we can find that our method outperforms existing parallel CODECs, such as Walsh-based parallel CODEC (WPC) and overloaded parallel CODEC (OPC). Specifically, it improves the critical point of communication latency (7.3% over WPC and 13.5% over OPC), reduces packet latency jitter by about 17.3% (against WPC) and 71.6% (against OPC), and improves energy efficiency by up to 41.2% (against WPC) and 59.2% (against OPC).

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2018
Keywords
Code division multiple access (CDMA), coder-decoder (CODEC), energy efficiency, network-on-chip (NoC), performance
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-226177 (URN)10.1109/TIE.2017.2786230 (DOI)000428902200050 ()2-s2.0-85039797002 (Scopus ID)
Note

QC 20180518

Available from: 2018-05-16 Created: 2018-05-16 Last updated: 2018-05-16Bibliographically approved
Long, Y., Lu, Z. & Shen, H. (2018). Composable Worst-Case Delay Bound Analysis Using Network Calculus. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(3), 705-709
Open this publication in new window or tab >>Composable Worst-Case Delay Bound Analysis Using Network Calculus
2018 (English)In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 37, no 3, p. 705-709Article in journal (Refereed) Published
Abstract [en]

Performance analysis is playing an indispensable role in design and evaluation for on-chip networks. In former studies, the end-to-end delay bound is calculated by the equivalent service curve method based on network calculus when resource sharing happens. However, in this paper, we propose a composable method to get the bound. This method uses the aggregated local arrival curve to get the local delay bound first, then calculates the end-to-end bound by summing up local bounds. This method solves the scalability problem and largely decreases the computation complexity compared with the former method.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2018
Keywords
Composable method, delay bound, local arrival curve (LAC), network calculus (NC)
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-224006 (URN)10.1109/TCAD.2017.2729283 (DOI)000425674700015 ()2-s2.0-85028811873 (Scopus ID)
Note

QC 20180323

Available from: 2018-03-23 Created: 2018-03-23 Last updated: 2018-05-24Bibliographically approved
Yao, Y. & Lu, Z. (2018). INPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA): . Paper presented at 24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Hotel Pyramide Congress Center, Vienna, Austria, 24 February 2018 through 28 February 2018 (pp. 15-26). IEEE Computer Society
Open this publication in new window or tab >>INPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores
2018 (English)In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), IEEE Computer Society, 2018, p. 15-26Conference paper, Published paper (Refereed)
Abstract [en]

As recently studied, serialized competition overhead for entering critical section is more dominant than critical section execution itself in limiting performance of multi-threaded shared variable applications on NoC-based many-cores. We illustrate that the invalidation-acknowledgement delay for cache coherency between the home node storing the critical section lock and the cores running competing threads is the leading factor to high competition overhead in lock spinning, which is realized in various spin-lock primitives (such as the ticket lock, ABQL, MCS lock, etc.) and the spinning phase of queue spin-lock (QSL) in advanced operating systems. To reduce such high lock coherence overhead, we propose in-network packet generation (iNPG) to turn passive 'normal' NoC routers which only transmit packets into active 'big' ones that can generate packets. Instead of performing all coherence maintenance at the home node, big routers which are deployed nearer to competing threads can generate packets to perform early invalidation-acknowledgement for failing threads before their requests reach the home node, shortening the protocol round-trip delay and thus significantly reducing competition overhead in various locking primitives. We evaluate iNPG in Gem5 using PARSEC and SPEC OMP2012 programs with five different locking primitives. Compared to a state-of-the-art technique accelerating critical section access, experimental results show that iNPG can effectively reduce lock coherence overhead, expediting critical section access by 1.35x on average and 2.03x at maximum and consequently improving the program Region-of-Interest (ROI) runtime by 7.8% on average and 14.7% at maximum.

Place, publisher, year, edition, pages
IEEE Computer Society, 2018
Keywords
Cache Coherency, CMP, Critical Section, In Network Packet Generation, Network on Chip, Synchronisation Primitive
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-228571 (URN)10.1109/HPCA.2018.00012 (DOI)2-s2.0-85046805697 (Scopus ID)9781538636596 (ISBN)
Conference
24th IEEE International Symposium on High Performance Computer Architecture, HPCA 2018, Hotel Pyramide Congress Center, Vienna, Austria, 24 February 2018 through 28 February 2018
Note

QC 20180528

Available from: 2018-05-28 Created: 2018-05-28 Last updated: 2018-05-28Bibliographically approved
Lu, Z. & Zhao, X. (2018). xMAS-Based QoS Analysis Methodology. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 37(2), 364-377
Open this publication in new window or tab >>xMAS-Based QoS Analysis Methodology
2018 (English)In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 37, no 2, p. 364-377Article in journal (Refereed) Published
Abstract [en]

On-chip communication system design starting from a high-level model can facilitate formal verification of system properties, such as safety and deadlock freedom. Yet, analyzing its quality-of-service (QoS) property, in our context, per-flow delay bound, is an open challenge. Based on executable micro-architectural specification (xMAS) which is a formal framework modeling communication fabrics, we first present how to model a classic input-queuing virtual channel router using the xMAS primitives and then a QoS analysis methodology using network calculus (NC). Thanks to the precise semantics of the xMAS primitives, the router can be modeled in different variants, which cannot be otherwise captured by normal ad hoc box diagrams. The analysis methodology consists of three steps: 1) given network and flow knowledge, we first create a well-defined precise xMAS model for a specific application on a concrete on-chip network; 2) the specific xMAS model is then mapped to an NC graph (NCG) following a set of mapping rules; and 3) finally, existing QoS analysis techniques can be applied to analyze the NCG to obtain end-to-end delay bound per flow. We also show how to apply the technique to a typical all-to-one communication pattern on a binary-tree network and conduct an SoC case study, exemplifying the step-by-step analysis procedure and discussing the tightness of the results.

Place, publisher, year, edition, pages
IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC, 2018
Keywords
Design methodology, executable micro-architectural specification (xMAS), network calculus (NC), network-on-chip (NoC), quality-of-service (QoS)
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-222173 (URN)10.1109/TCAD.2017.2706561 (DOI)000422948500008 ()2-s2.0-85040972387 (Scopus ID)
Note

QC 20180207

Available from: 2018-02-07 Created: 2018-02-07 Last updated: 2018-03-13Bibliographically approved
Zhao, X. & Lu, Z. (2017). A Tool for xMAS-Based Modeling and Analysis of Communication Fabrics in Simulink. ACM Transactions on Modeling and Computer Simulation, 27(3), Article ID 16.
Open this publication in new window or tab >>A Tool for xMAS-Based Modeling and Analysis of Communication Fabrics in Simulink
2017 (English)In: ACM Transactions on Modeling and Computer Simulation, ISSN 1049-3301, E-ISSN 1558-1195, Vol. 27, no 3, article id 16Article in journal (Refereed) Published
Abstract [en]

The eXecutable Micro-Architectural Specification (xMAS) language developed in recent years finds an effective way to model on-chip communication fabrics and enables performance-bound analysis with network calculus at the micro-architectural level. For network-on-Chip (NoC) performance analysis, model validation is essential to ensure correctness and accuracy. In order to facilitate the xMAS modeling and corresponding analysis validation, this work presents a unified platform based on xMAS in Simulink. The platform provides a friendly graphical user interface for xMAS modeling and parameter setup by taking advantages of the Simulink modeling environment. The regulator and latency-rate sever are added to the xMAS primitive set to support typical flow and service behaviors. Hierarchical model build-up and Verilog-HDL code generation are essentially supported to manage complex models and to conduct cycle-accurate bit-accurate simulations. Based on the generated simulation models of xMAS, this tool is applied to evaluate the tightness of analytical delay bound results. We demonstrate the application as well as the work flow of the xMAS tool through a two-agent communication example and an all-to-one communication example with a tree topology.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2017
Keywords
Performance analysis, simulink, network calculus, network on chip
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-215460 (URN)10.1145/3005446 (DOI)000411266600001 ()2-s2.0-85028652385 (Scopus ID)
Note

QC 20171018

Available from: 2017-10-18 Created: 2017-10-18 Last updated: 2018-01-13Bibliographically approved
Wang, J., Chen, Z., Guo, J., Li, Y. & Lu, Z. (2017). ACO-Based Thermal-Aware Thread-to-Core Mapping for Dark-Silicon-Constrained CMPs. IEEE Transactions on Electron Devices, 64(3), 930-937
Open this publication in new window or tab >>ACO-Based Thermal-Aware Thread-to-Core Mapping for Dark-Silicon-Constrained CMPs
Show others...
2017 (English)In: IEEE Transactions on Electron Devices, ISSN 0018-9383, E-ISSN 1557-9646, Vol. 64, no 3, p. 930-937Article in journal (Refereed) Published
Abstract [en]

The limitation on thermal budget in chip multiprocessor (CMP) results in a fraction of inactive silicon regions called dark silicon, which significantly impacts the system performance. In this paper, we propose a thread-to-core mapping method for dark-silicon-constrainedCMPs to address their thermal issue. We first propose a thermal predictionmodel to forecast CMP temperature after the CMP executes a forthcoming application. Then, we develop an ant colony optimization-based algorithm to conduct the thread-to- core mapping process, such that the CMP peak temperature is minimized and, consequently, the probability of triggering CMP dynamic thermal management is decreased. Finally, we evaluate our method and compare it with the baseline (a standard Linux scheduler) and other existing methods (NoC-Sprinting, DaSiM mapping, and TP mapping). The simulation results show that our method gains good thermal profile and computational performance, and performs well with chip scaling. Specifically, it eliminates all thermal emergency time, outperforming all other methods, and gains million instructions per second improvement up to 12.9% against the baseline.

Place, publisher, year, edition, pages
IEEE Press, 2017
Keywords
Chip multiprocessor (CMP), dark silicon, thermal model, thread-to-core mapping
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-204059 (URN)10.1109/TED.2017.2653838 (DOI)000396056700030 ()2-s2.0-85011294756 (Scopus ID)
Note

QC 20170330

Available from: 2017-03-30 Created: 2017-03-30 Last updated: 2018-01-13Bibliographically approved
Lu, Z. & Yao, Y. (2017). Dynamic Traffic Regulation in NoC-Based Systems. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 25(2), 556-569
Open this publication in new window or tab >>Dynamic Traffic Regulation in NoC-Based Systems
2017 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 25, no 2, p. 556-569Article in journal (Refereed) Published
Abstract [en]

In network-on-chip (NoC)-based systems, performance enhancement has primarily focused on the network itself, with little attention paid on controlling traffic injection at the network boundary. This is unsatisfactory because traffic may be over injected, aggravating congestion, and lowering performance. Recently, traffic regulation is proposed as an orthogonal means for performance improvement. Rather than as soon as possible admission, traffic regulation may hold back packet injection by admitting packets into the network only when the accumulated traffic volume at any time interval does not exceed a threshold. These regulation techniques are, however, often static, likely causing overregulation and underregulation. We propose dynamic traffic regulation to improve the system performance for NoC-based multi/many-processor systemson- chip (MPSoC) and chip multi/many-core processor (CMP) designs. It can be applied to MPSoCs for intellectual property integration in an open-loop fashion by injecting traffic according to its run-time profiled characteristics. It can also be applied to CMPs in a closed-loop fashion by admitting traffic fully adaptive to the traffic and network states. Through extensive experiments and results, we show that both the open-loop and closed-loop dynamic regulation techniques can significantly improve the network and system performance.

Place, publisher, year, edition, pages
IEEE Press, 2017
Keywords
Chip multi/many-core processor (CMP), fuzzy control, multi/many-processor systems-on-chip (MPSoC), network-on-chip (NoC), traffic engineering
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-204100 (URN)10.1109/TVLSI.2016.2584781 (DOI)000394593300015 ()2-s2.0-84979735738 (Scopus ID)
Note

QC 20170329

Available from: 2017-03-29 Created: 2017-03-29 Last updated: 2017-11-29Bibliographically approved
Xiong, Q., Wu, F., Lu, Z. & Xie, C. (2017). Extending Real-Time Analysis for Wormhole NoCs. I.E.E.E. transactions on computers (Print), 66(9), 1532-1546, Article ID 7884964.
Open this publication in new window or tab >>Extending Real-Time Analysis for Wormhole NoCs
2017 (English)In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 66, no 9, p. 1532-1546, article id 7884964Article in journal (Refereed) Published
Abstract [en]

The delay upper-bound analysis problem is of fundamental importance to real-Time applications in Network-on-Chips (NoCs). In the paper, we revisit two state-of-The-Art analysis models for real-Time communication in wormhole NoCs with priority-based preemptive arbitration and show that the models only support specific router architectures with large buffer sizes. We then propose an extended analysis model to estimate delay upper-bounds for all router architectures and buffer sizes by identifying and analyzing the differences between upstream and downstream indirect interferences according to the relative positions of traffic flows and taking the buffer influence into consideration. Simulated evaluations show that our model supports one more router architecture and applies to small buffer sizes compared to the previous models.

Place, publisher, year, edition, pages
IEEE Computer Society, 2017
Keywords
delay, real-Time communication, Wormhole NoC, Computer architecture, Network architecture, Routers, Extended analysis, Real time analysis, Real-time application, Relative positions, Router architecture, Upper bound analysis, Network-on-chip
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-216210 (URN)10.1109/TC.2017.2686391 (DOI)000407449400006 ()2-s2.0-85029510717 (Scopus ID)
Note

QC 20171218

Available from: 2017-12-18 Created: 2017-12-18 Last updated: 2018-01-13Bibliographically approved
Lu, Z. & Yao, Y. (2017). Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS. I.E.E.E. transactions on computers (Print), 66(11), 1903-1917
Open this publication in new window or tab >>Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS
2017 (English)In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 66, no 11, p. 1903-1917Article in journal (Refereed) Published
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-216598 (URN)10.1109/TC.2017.2715018 (DOI)000412566600006 ()2-s2.0-85021826419 (Scopus ID)
Note

QC 20171116

Available from: 2017-11-16 Created: 2017-11-16 Last updated: 2018-03-07Bibliographically approved
Chen, X., Lu, Z., Liu, S. & Chen, S. (2017). Round-trip DRAM Access Fairness in 3D NoC-based Many-core Systems. ACM Transactions on Embedded Computing Systems, 16, Article ID 162.
Open this publication in new window or tab >>Round-trip DRAM Access Fairness in 3D NoC-based Many-core Systems
2017 (English)In: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 16, article id 162Article in journal (Refereed) Published
Abstract [en]

In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD) 1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (< 0.6%) in comparison to the three counterparts.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2017
Keywords
3D Networks-on-Chip (NoC), DRAM access fairness, DRAM scheduling, round-trip
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-217943 (URN)10.1145/3126561 (DOI)000414353800045 ()2-s2.0-85030680902 (Scopus ID)
Note

QC 20171121

Available from: 2017-11-21 Created: 2017-11-21 Last updated: 2017-11-21Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0061-3475

Search in DiVA

Show all publications