Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 175) Show all publications
Zhao, X. & Lu, Z. (2017). A Tool for xMAS-Based Modeling and Analysis of Communication Fabrics in Simulink. ACM Transactions on Modeling and Computer Simulation, 27(3), Article ID 16.
Open this publication in new window or tab >>A Tool for xMAS-Based Modeling and Analysis of Communication Fabrics in Simulink
2017 (English)In: ACM Transactions on Modeling and Computer Simulation, ISSN 1049-3301, E-ISSN 1558-1195, Vol. 27, no 3, 16Article in journal (Refereed) Published
Abstract [en]

The eXecutable Micro-Architectural Specification (xMAS) language developed in recent years finds an effective way to model on-chip communication fabrics and enables performance-bound analysis with network calculus at the micro-architectural level. For network-on-Chip (NoC) performance analysis, model validation is essential to ensure correctness and accuracy. In order to facilitate the xMAS modeling and corresponding analysis validation, this work presents a unified platform based on xMAS in Simulink. The platform provides a friendly graphical user interface for xMAS modeling and parameter setup by taking advantages of the Simulink modeling environment. The regulator and latency-rate sever are added to the xMAS primitive set to support typical flow and service behaviors. Hierarchical model build-up and Verilog-HDL code generation are essentially supported to manage complex models and to conduct cycle-accurate bit-accurate simulations. Based on the generated simulation models of xMAS, this tool is applied to evaluate the tightness of analytical delay bound results. We demonstrate the application as well as the work flow of the xMAS tool through a two-agent communication example and an all-to-one communication example with a tree topology.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2017
Keyword
Performance analysis, simulink, network calculus, network on chip
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-215460 (URN)10.1145/3005446 (DOI)000411266600001 ()2-s2.0-85028652385 (Scopus ID)
Note

QC 20171018

Available from: 2017-10-18 Created: 2017-10-18 Last updated: 2018-01-13Bibliographically approved
Wang, J., Chen, Z., Guo, J., Li, Y. & Lu, Z. (2017). ACO-Based Thermal-Aware Thread-to-Core Mapping for Dark-Silicon-Constrained CMPs. IEEE Transactions on Electron Devices, 64(3), 930-937.
Open this publication in new window or tab >>ACO-Based Thermal-Aware Thread-to-Core Mapping for Dark-Silicon-Constrained CMPs
Show others...
2017 (English)In: IEEE Transactions on Electron Devices, ISSN 0018-9383, E-ISSN 1557-9646, Vol. 64, no 3, 930-937 p.Article in journal (Refereed) Published
Abstract [en]

The limitation on thermal budget in chip multiprocessor (CMP) results in a fraction of inactive silicon regions called dark silicon, which significantly impacts the system performance. In this paper, we propose a thread-to-core mapping method for dark-silicon-constrainedCMPs to address their thermal issue. We first propose a thermal predictionmodel to forecast CMP temperature after the CMP executes a forthcoming application. Then, we develop an ant colony optimization-based algorithm to conduct the thread-to- core mapping process, such that the CMP peak temperature is minimized and, consequently, the probability of triggering CMP dynamic thermal management is decreased. Finally, we evaluate our method and compare it with the baseline (a standard Linux scheduler) and other existing methods (NoC-Sprinting, DaSiM mapping, and TP mapping). The simulation results show that our method gains good thermal profile and computational performance, and performs well with chip scaling. Specifically, it eliminates all thermal emergency time, outperforming all other methods, and gains million instructions per second improvement up to 12.9% against the baseline.

Place, publisher, year, edition, pages
IEEE Press, 2017
Keyword
Chip multiprocessor (CMP), dark silicon, thermal model, thread-to-core mapping
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-204059 (URN)10.1109/TED.2017.2653838 (DOI)000396056700030 ()2-s2.0-85011294756 (Scopus ID)
Note

QC 20170330

Available from: 2017-03-30 Created: 2017-03-30 Last updated: 2018-01-13Bibliographically approved
Lu, Z. & Yao, Y. (2017). Dynamic Traffic Regulation in NoC-Based Systems. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 25(2), 556-569.
Open this publication in new window or tab >>Dynamic Traffic Regulation in NoC-Based Systems
2017 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 25, no 2, 556-569 p.Article in journal (Refereed) Published
Abstract [en]

In network-on-chip (NoC)-based systems, performance enhancement has primarily focused on the network itself, with little attention paid on controlling traffic injection at the network boundary. This is unsatisfactory because traffic may be over injected, aggravating congestion, and lowering performance. Recently, traffic regulation is proposed as an orthogonal means for performance improvement. Rather than as soon as possible admission, traffic regulation may hold back packet injection by admitting packets into the network only when the accumulated traffic volume at any time interval does not exceed a threshold. These regulation techniques are, however, often static, likely causing overregulation and underregulation. We propose dynamic traffic regulation to improve the system performance for NoC-based multi/many-processor systemson- chip (MPSoC) and chip multi/many-core processor (CMP) designs. It can be applied to MPSoCs for intellectual property integration in an open-loop fashion by injecting traffic according to its run-time profiled characteristics. It can also be applied to CMPs in a closed-loop fashion by admitting traffic fully adaptive to the traffic and network states. Through extensive experiments and results, we show that both the open-loop and closed-loop dynamic regulation techniques can significantly improve the network and system performance.

Place, publisher, year, edition, pages
IEEE Press, 2017
Keyword
Chip multi/many-core processor (CMP), fuzzy control, multi/many-processor systems-on-chip (MPSoC), network-on-chip (NoC), traffic engineering
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-204100 (URN)10.1109/TVLSI.2016.2584781 (DOI)000394593300015 ()2-s2.0-84979735738 (Scopus ID)
Note

QC 20170329

Available from: 2017-03-29 Created: 2017-03-29 Last updated: 2017-11-29Bibliographically approved
Xiong, Q., Wu, F., Lu, Z. & Xie, C. (2017). Extending Real-Time Analysis for Wormhole NoCs. I.E.E.E. transactions on computers (Print), 66(9), 1532-1546, Article ID 7884964.
Open this publication in new window or tab >>Extending Real-Time Analysis for Wormhole NoCs
2017 (English)In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 66, no 9, 1532-1546 p., 7884964Article in journal (Refereed) Published
Abstract [en]

The delay upper-bound analysis problem is of fundamental importance to real-Time applications in Network-on-Chips (NoCs). In the paper, we revisit two state-of-The-Art analysis models for real-Time communication in wormhole NoCs with priority-based preemptive arbitration and show that the models only support specific router architectures with large buffer sizes. We then propose an extended analysis model to estimate delay upper-bounds for all router architectures and buffer sizes by identifying and analyzing the differences between upstream and downstream indirect interferences according to the relative positions of traffic flows and taking the buffer influence into consideration. Simulated evaluations show that our model supports one more router architecture and applies to small buffer sizes compared to the previous models.

Place, publisher, year, edition, pages
IEEE Computer Society, 2017
Keyword
delay, real-Time communication, Wormhole NoC, Computer architecture, Network architecture, Routers, Extended analysis, Real time analysis, Real-time application, Relative positions, Router architecture, Upper bound analysis, Network-on-chip
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-216210 (URN)10.1109/TC.2017.2686391 (DOI)000407449400006 ()2-s2.0-85029510717 (Scopus ID)
Note

QC 20171218

Available from: 2017-12-18 Created: 2017-12-18 Last updated: 2018-01-13Bibliographically approved
Lu, Z. & Yao, Y. (2017). Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS. I.E.E.E. transactions on computers (Print), 66(11), 1903-1917.
Open this publication in new window or tab >>Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS
2017 (English)In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 66, no 11, 1903-1917 p.Article in journal (Refereed) Published
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-216598 (URN)10.1109/TC.2017.2715018 (DOI)000412566600006 ()2-s2.0-85021826419 (Scopus ID)
Note

QC 20171116

Available from: 2017-11-16 Created: 2017-11-16 Last updated: 2018-01-13Bibliographically approved
Chen, X., Lu, Z., Liu, S. & Chen, S. (2017). Round-trip DRAM Access Fairness in 3D NoC-based Many-core Systems. ACM Transactions on Embedded Computing Systems, 16, Article ID 162.
Open this publication in new window or tab >>Round-trip DRAM Access Fairness in 3D NoC-based Many-core Systems
2017 (English)In: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 16, 162Article in journal (Refereed) Published
Abstract [en]

In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD) 1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (< 0.6%) in comparison to the three counterparts.

Place, publisher, year, edition, pages
ASSOC COMPUTING MACHINERY, 2017
Keyword
3D Networks-on-Chip (NoC), DRAM access fairness, DRAM scheduling, round-trip
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-217943 (URN)10.1145/3126561 (DOI)000414353800045 ()2-s2.0-85030680902 (Scopus ID)
Note

QC 20171121

Available from: 2017-11-21 Created: 2017-11-21 Last updated: 2017-11-21Bibliographically approved
Chen, X., Lu, Z., Liu, S. & Chen, S. (2017). Round-trip DRAM access fairness in 3D NoC-based many-core systems. ACM Transactions on Embedded Computing Systems, 16(5s), Article ID 162.
Open this publication in new window or tab >>Round-trip DRAM access fairness in 3D NoC-based many-core systems
2017 (English)In: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 16, no 5s, 162Article in journal (Refereed) Published
Abstract [en]

In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD)1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (&lt;0.6%) in comparison to the three counterparts.

Place, publisher, year, edition, pages
Association for Computing Machinery, 2017
Keyword
3D networks-on-chip (NoC), DRAM access fairness, DRAM scheduling, Round-trip, Scheduling, 3D networks, Communication distance, First come first serves, Hardware overheads, Performance bottlenecks, Round trip, Scheduling policies, Standard deviation, Network-on-chip
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-216205 (URN)10.1145/3126561 (DOI)000414353800045 ()2-s2.0-85030680902 (Scopus ID)
Note

QC 20171218

Available from: 2017-12-18 Created: 2017-12-18 Last updated: 2018-01-13Bibliographically approved
Du, G., Ma, S., Li, Z., Lu, Z., Ouyang, Y. & Gao, M. (2017). Work-in-progress: SSS: Self-aware system-on-chip using static-dynamic hybrid method. In: Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion, CASES 2017: . Paper presented at 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES 2017, Seoul, South Korea, 15 October 2017 through 20 October 2017. Association for Computing Machinery (ACM), Article ID 3125527.
Open this publication in new window or tab >>Work-in-progress: SSS: Self-aware system-on-chip using static-dynamic hybrid method
Show others...
2017 (English)In: Proceedings of the 2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems Companion, CASES 2017, Association for Computing Machinery (ACM), 2017, 3125527Conference paper (Refereed)
Abstract [en]

Network on chip has become the de facto communication standard for multi-core or many-core system on chip, due to its scalability and flexibility. However, temperature is an important factor in NoC design, which affects the overall performance of SoC-decreasing circuit frequency, increasing energy consumption, and even shortening chip lifetime. In this paper, we propose SSS, a self-aware SoC using a static-dynamic hybrid method, which combines dynamic mapping and static mapping to reduce the hot-spots temperature for NoC based SoCs. First, we propose monitoring the thermal distribution for self-state sensoring. Then, in static mapping stage, we calculate the optimal mapping solutions under different temperature modes using discrete firefly algorithm to help self-decision making. Finally, in dynamic mapping stage, we achieve dynamic mapping through configuring NoC and SoC sentient unit for selfoptimizing. Experimental results show SSS can reduce the peak temperature by up to 30.64%. FPGA prototype shows the effectiveness and smartness of SSS in reducing hot-spots temperature. Self-awareness, SoC architecture, NoC.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2017
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-219651 (URN)10.1145/3125501.3125527 (DOI)2-s2.0-85035354020 (Scopus ID)9781450351843 (ISBN)
Conference
2017 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES 2017, Seoul, South Korea, 15 October 2017 through 20 October 2017
Note

QC 20171213

Available from: 2017-12-13 Created: 2017-12-13 Last updated: 2017-12-13Bibliographically approved
Ma, N., Zou, Z., Lu, Z., Zheng, L., Huan, Y. & Blixt, S. (2016). A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Embedded Processor for Domain-specific Applications. In: Proceedings - IEEE International Symposium on Circuits and Systems: . Paper presented at IEEE International Symposium on Circuit and System (ISCAS) (pp. 1746-1749). IEEE.
Open this publication in new window or tab >>A 101.4 GOPS/W Reconfigurable and Scalable Control-centric Embedded Processor for Domain-specific Applications
Show others...
2016 (English)In: Proceedings - IEEE International Symposium on Circuits and Systems, IEEE, 2016, 1746-1749 p.Conference paper, Published paper (Refereed)
Abstract [en]

Increasing the energy efficiency and performance while providing the customizability and scalability is vital for embedded processors adapting to domain-specific applications such as Internet of Things. In this paper, we proposed a reconfigurable and scalable control-centric architecture, and implemented the design consisting of two cores and an on-chip multi-mode router in 65 nm technology. The reconfigurability is enabled by the restructurable sequence mapping table (SMT) thus the reorganizable functional units. Owing to the integration of the multi-mode router, on-chip or inter-chip network for multi-/many-core computing can be composed for performance extension on demand even in the post-fabrication stage. Control-centric design simplifies the control logic, shrinks the non-functional units and orchestrates the operations to increase the hard are utilization and reduce the excessive data movement for high energy efficiency. As a result, the processor can both conduct general-purpose processing with 29% smaller code size and application-specific processing with over 10 times performance improvement when implementing AES by SMT. The dual-core processor consumes 19.7 μW/MHz with die size of 3.5 mm2. The achieved energy efficiency is 101.4GOPS/W.

Place, publisher, year, edition, pages
IEEE, 2016
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-169547 (URN)10.1109/ISCAS.2016.7538905 (DOI)2-s2.0-84983396457 (Scopus ID)978-147995340-0 (ISBN)
Conference
IEEE International Symposium on Circuit and System (ISCAS)
Note

QC 20160613

Available from: 2015-06-16 Created: 2015-06-16 Last updated: 2016-12-15Bibliographically approved
Wang, J., Lu, Z. & Li, Y. (2016). A New CDMA Encoding/Decoding Method for on-Chip Communication Network. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 24(4), 1607-1611.
Open this publication in new window or tab >>A New CDMA Encoding/Decoding Method for on-Chip Communication Network
2016 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 24, no 4, 1607-1611 p.Article in journal (Refereed) Published
Abstract [en]

As a high performance on-chip communication method, the code division multiple access (CDMA) technique has recently been applied to networks on chip (NoCs). We propose a new standard-basis-based encoding/decoding method to leverage the performance and cost of CDMA NoCs in area, power assumption, and network throughput. In the transmitter module, source data from different senders are separately encoded with an orthogonal code of a standard basis and these coded data are mixed together by an XOR operation. Then, the sums of data can be transmitted to their destinations through the onchip communication infrastructure. In the receiver module, a sequence of chips is retrieved by taking an AND operation between the sums of data and the corresponding orthogonal code. After a simple accumulation of these chips, original data can be reconstructed. We implement our encoding/decoding method and apply it to a CDMA NoC with a star topology. Compared with the state-of-the-art Walsh-code-based (WB) encoding/decoding technique, our method achieves up to 67.46% power saving and 81.24% area saving together with decrease of 30%-50% encoding/decoding latency. Moreover, the CDMA NoC with different sizes applying our encoding/decoding method gains power saving, area saving, and maximal throughput improvement up to 20.25%, 22.91%, and 103.26%, respectively, than the WB CDMA NoC.

Place, publisher, year, edition, pages
IEEE, 2016
Keyword
Code division multiple access (CDMA), integrated circuit (IC), network on chip (NoC)
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-185617 (URN)10.1109/TVLSI.2015.2471077 (DOI)000373020200039 ()2-s2.0-84941890086 (Scopus ID)
Note

QC 20160429

Available from: 2016-04-29 Created: 2016-04-25 Last updated: 2018-01-10Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0061-3475

Search in DiVA

Show all publications