Change search
Refine search result
1 - 25 of 25
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Candaele, Bernard
    et al.
    Aguirre, Sylvain
    Sarlotte, Michel
    Anagnostopoulos, Iraklis
    Xydis, Sotirios
    Bartzas, Alexandros
    Bekiaris, Dimitris
    Soudris, Dimitrios
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Chabloz, Jean-Michel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Vanmeerbeeck, Geert
    Kreku, Jari
    Tiensyrja, Kari
    Ieromnimon, Fragkiskos
    Kritharidis, Dimitrios
    Wiefrink, Andreas
    Vanthournout, Bart
    Martin, Philippe
    Mapping Optimisation for Scalable multi-core ARchiTecture: The MOSART approach2010In: Proceedings - IEEE Annual Symposium on VLSI, ISVLSI 2010, 2010, p. 518-523Conference paper (Refereed)
    Abstract [en]

    The project will address two main challenges of prevailing architectures: 1) The global Interconnect and memory bottleneck due to a single, globally shared memory with high access times and power consumption; 2) The difficulties in programming heterogeneous, multi-core platforms, in particular in dynamically managing data structures in distributed memory. MOSART aims to overcome these through a multi-core architecture with distributed memory organisation, a Network-on-Chip (NoC) communication backbone and configurable processing cores that are scaled, optimised and customised together to achieve diverse energy, performance, cost and size requirements of different classes of applications. MOSART achieves this by: A) Providing platform support for management of abstract data structures Including middleware services and a run-time data manager for NoC based communication infrastructure; 2) Developing tool support for parallelizing and mapping applications on the multi-core target platform and customizing the processing cores for the application.

  • 2. Candaele, Bernard
    et al.
    Aguirre, Sylvain
    Sarlotte, Michel
    Anagnostopoulos, Iraklis
    Xydis, Sotirios
    Bartzas, Alexandros
    Bekiaris, Dimitris
    Soudris, Dimitrios
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chabloz, Jean-Michel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Vanmeerbeeck, Geert
    Kreku, Jari
    Tiensyrja, Kari
    Ieromnimon, Fragkiskos
    Kritharidis, Dimitrios
    Wiefrink, Andreas
    Vanthournout, Bart
    Martin, Philippe
    The MOSART Mapping Optimization for multi-core Architectures2011In: VLSI 2010 Annual Symposium, Springer Publishing Company, 2011, p. 181-195Conference paper (Refereed)
    Abstract [en]

    MOSART project addresses two main challenges of prevailing architectures: (i) Theglobal interconnect and memory bottleneck due to a single, globally shared memorywith high access times and power consumption; (ii) The difficulties in programmingheterogeneous, multi-core platforms MOSART aims to overcome these through amulti-core architecture with distributed memory organization, a Network-on-Chip(NoC) communication backbone and configurable processing cores that are scaled,optimized and customized together to achieve diverse energy, performance, cost andsize requirements of different classes of applications. MOSART achieves this by:(i) Providing platform support for management of abstract data structures includingmiddleware services and a run-time data manager for NoC based communicationinfrastructure; (ii) Developing tool support for parallelizing and mapping applicationson the multi-core target platform and customizing the processing cores for theapplication.

  • 3.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Area and Performance Optimization of Barrier Synchronization on Multi-core Network-on-Chips2010In: 3rd IEEE International Conference on Computer and Electrical Engineering (ICCEE), 2010Conference paper (Refereed)
    Abstract [en]

    Barrier synchronization is commonly and widelyused to synchronize the execution of parallel processor coreson multi-core Network-on-Chips (NoCs). Since its globalnature may cause heavy serialization resulting in largeperformance penalty, barrier synchronization should becarefully designed to have low latency communication and tominimize overall completion time. Therefore, in the paper, wepropose a fast barrier synchronization mechanism, targetingMulti-core NoCs. The fast barrier synchronization mechanismincludes a dedicated hardware module, named Fast BarrierSynchronizer (FBS), integrated with each processor node. Itoffers a set of barrier counters and can concurrently processsynchronization requests issued by the local node and remotenodes via the on-chip network. The salient feature of our fastbarrier synchronization mechanism is that, once the barriercondition is reached, the “barrier release” acknowledgement isrouted to all processor nodes in a broadcast way in order tosave chip area by avoiding storing source node informationand to minimize completion time by avoiding serialization ofbarrier releasing. Synthesis results suggest that the FBS canrun over 1 GHz in SMIC® 130nm technology with small areaoverhead. We implemented a FBS-enhanced multi-core NoCarchitecture on our FPGA platform using the Xilinx® Virtex 5as the FPGA chip. FPGA utilization and simulation resultsshow that our fast barrier synchronization demonstrates botharea and performance advantages over the barriersynchronization counterpart with unicast barrier releasing.

  • 4.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hybrid distributed shared memory space in multi-core processors2011In: Journal of Software, ISSN 1796-217X, Vol. 6, no 12 SPEC. ISSUE, p. 2369-2378Article in journal (Refereed)
    Abstract [en]

    On multi-core processors, memories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memory addresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtualto- Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. The hybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressing on shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-time partitioning of hybrid DSM organization in order to analyze its performance. A real DSM based multi-core platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioning demonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improvement depends on problem size, way of data partitioning and computation/communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.

  • 5.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Xu, Bangjian
    Luo, Heng
    Multi-FPGA Implementation of a Network-on-Chip Based Many-core Architecture with Fast Barrier Synchronization Mechanism2010In: Proceedings of the IEEE Norchip Conference, 2010Conference paper (Refereed)
    Abstract [en]

    In this paper, we propose a fast barrier synchronization mechanism, targetingNetwork-on-Chip based manycore architectures. Its salient feature is that, once thebarrier condition is reached, the "barrier release" acknowledgement is routed to all processor nodes in a broadcast way in order to save area by avoiding storing source node information and to minimize completion time by eliminating serialization of barrierreleasing. Then, we construct a multi-FPGA platform using Xilinx® Virtex 5 as FPGA chipsand implement a NoC based many-core architecture on it. FPGA utilization and simulation results show that our mechanism demonstrates both area and performance advantages over the barrier synchronization counterpart with unicast barrier releasing. 

  • 6.
    Chen, Xiaowen
    et al.
    University of Maine, USA.
    Lawoko, Martin
    University of Maine, USA.
    van Heiningen, Adriaan
    University of Maine USA.
    Kinetics and mechanism of autohydrolysis of hardwoods2010In: Bioresource Technology, ISSN 0960-8524, E-ISSN 1873-2976, Vol. 101, no 20, p. 7812-7819Article in journal (Refereed)
    Abstract [en]

    Autohydrolysis using water is a promising method to extract hemicelluloses from wood prior to pulping in order to make co-products such as ethanol and acetic acid besides pulp. Many studies have been carried out on the kinetics and mechanism of autohydrolysis using batch reactors. The present study was performed in a continuous mixed flow reactor where the wood chips are retained in a basket inside the reactor. This reactor is well suited to determine intrinsic kinetics of hemicellulose dissolution because the dissolved products are rapidly removed from the reactor, thus minimizing further hydrolysis and degradation of the hemicelluloses in solution. The xylan removal rate follows an S-shaped behavior. GPC analysis of the continuously removed extract shows that the dissolved xylan oligomers have a DP smaller than about 25. Lignin-free xylan oligomers and cellulose oligomers are the major components dissolved in the initial stage of autohydrolysis, while xylan covalently bound to lignin (i.e. an LCC) is the major component removed during the later stage of autohydrolysis. The molecular weight of the dissolved components decreases with time in the second stage. The kinetics of xylan removal are explained in terms of a mechanism based on recent knowledge of the ultrastructure of the cell fibre wall.

  • 7.
    Chen, Xiaowen
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS). Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China.
    Lei, Yuanwu
    Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China..
    Lu, Zhonghai
    KTH, School of Electrical Engineering and Computer Science (EECS), Electronics, Electronic and embedded systems.
    Chen, Shuming
    Natl Univ Def Technol, Coll Comp, Changsha 410073, Hunan, Peoples R China..
    A Variable-Size FFT Hardware Accelerator Based on Matrix Transposition2018In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 26, no 10, p. 1953-1966Article in journal (Refereed)
    Abstract [en]

    Fast Fourier transform (FFT) is the kernel and the most time-consuming algorithm in the domain of digital signal processing, and the FFT sizes of different applications are very different. Therefore, this paper proposes a variable-size FFT hardware accelerator, which fully supports the IEEE-754 single-precision floating-point standard and the FFT calculation with a wide size range from 2 to 220 points. First, a parallel Cooley-Tukey FFT algorithm based on matrix transposition (MT) is proposed, which can efficiently divide a large size FFT into several small size FFTs that can be executed in parallel. Second, guided by this algorithm, the FFT hardware accelerator is designed, and several FFT performance optimization techniques such as hybrid twiddle factor generation, multibank data memory, block MT, and token-based task scheduling are proposed. Third, its VLSI implementation is detailed, showing that it can work at 1 GHz with the area of 2.4 mm(2) and the power consumption of 91.3 mW at 25 degrees C, 0.9 V. Finally, several experiments are carried out to evaluate the proposal's performance in terms of FFT execution time, resource utilization, and power consumption. Comparative experiments show that our FFT hardware accelerator achieves at most 18.89x speedups in comparison to two software-only solutions and two hardware-dedicated solutions.

  • 8.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Run-time Partitioning of Hybrid Distributed Shared Memory on Multi-core Network-on-Chips2010In: The 3rd IEEE International Symposium on Parallel Architectures, Algorithms and Programming (PAAP 2010), 2010, p. 39-46Conference paper (Refereed)
    Abstract [en]

    On multi-core Network-on-Chips (NoCs), mem- ories are preferably distributed and supporting Distributed Shared Memory (DSM) is essential for the sake of reusing huge amount of legacy code and easy programming. However, the DSM organization imports the inherent overhead of translating virtual memory addresses into physical memoryaddresses, resulting in negative performance. We observe that, in parallel applications, different data have different properties (private or shared). For the private data accesses, it's unnecessary to perform Virtual-to-Physical address translations. Even for the same datum, its property may be changeable in different phases of the program execution. Therefore, this paper focuses on decreasing the overhead of Virtual-to-Physical address translation and hence improving the system performance by introducing hybrid DSM organization and supporting run-time partitioning according to the data property. Thehybrid DSM organization aims at supporting fast and physical memory accesses for private data and maintaining a global and single virtual memory space for shared data. Based on the data property of parallel applications, the run-time partitioning supports changing the hybrid DSM organization during the program execution. It ensures fast physical memory addressing on private data and conventional virtual memory addressingon shared data, improving the performance of the entire system by reducing virtual-to-physical address translation overhead as much as possible. We formulate the run-timepartitioning of hybrid DSM organization in order to analyze its perfor- mance. A real DSM based multi-core NoC platform is also constructed. The experimental results of real applications show that the hybrid DSM organization with run-time partitioningdemonstrates performance advantage over the conventional DSM counterpart. The percentage of performance improve- ment depends on problem size, way of datapartitioning and computation/ communication ratio of parallel applications, network size of the system, etc. In our experiments, the maximal improvement is 34.42%, the minimal improvement 3.68%.

  • 9.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Jantsch, A.
    Chen, S.
    Guo, Y.
    Chen, H.
    Performance analysis of homogeneous on-chip large-scale parallel computing architectures for data-parallel applications2015In: Journal of Electrical and Computer Engineering, ISSN 2090-0147, E-ISSN 2090-0155, Vol. 2015, article id 902591Article in journal (Refereed)
    Abstract [en]

    On-chip computing platforms are evolving from single-core bus-based systems to many-core network-based systems, which are referred to as On-chip Large-scale Parallel Computing Architectures (OLPCs) in the paper. Homogenous OLPCs feature strong regularity and scalability due to its identical cores and routers. Data-parallel applications have their parallel data subsets that are handled individually by the same program running in different cores. Therefore, data-parallel applications are able to obtain good speedup in homogenous OLPCs. The paper addresses modeling the speedup performance of homogeneous OLPCs for data-parallel applications. When establishing the speedup performance model, the network communication latency and the ways of storing data of data-parallel applications are modeled and analyzed in detail. Two abstract concepts (equivalent serial packet and equivalent serial communication) are proposed to construct the network communication latency model. The uniform and hotspot traffic models are adopted to reflect the ways of storing data. Some useful suggestions are presented during the performance model's analysis. Finally, three data-parallel applications are performed on our cycle-accurate homogenous OLPC experimental platform to validate the analytic results and demonstrate that our study provides a feasible way to estimate and evaluate the performance of data-parallel applications onto homogenous OLPCs.

  • 10.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Handling Shared Variable Synchronization in Multi-core Network-on-Chips with Distributed Memory2010In: Proceedings: IEEE International SOC Conference, SOCC 2010, 2010, p. 467-472Conference paper (Refereed)
    Abstract [en]

    Parallelized shared variable applications running on multi-core Network-on-Chips(NoCs) require efficient support for synchronization, since communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. In this paper, we propose a dedicated hardware module forsynchronization management. This module is called Synchronization Handler (SH), integrated with each processor-memory node on the multi-core NoCs. It uses two physical buffers to concurrently process synchronization requests issued by the local processor and remote processors via the on-chip network. One salient feature is that the two physical buffers are dynamically allocated to form multiple virtual buffers (a virtual buffer is related to a shared synchronization variable) so as to improve the buffer utilization and alleviate the head-of-line blocking. Synthesis results suggest that the SH can run over 900 MHz in 130nm technology with small area overhead. To justify the SH-enhanced multicore NoCs, we employ synthetic workloads to evaluate synchronizationcost and buffer utilization, and run synchronization-intensive applications to investigate speedup. The results show that our approach is viable.

  • 11.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Chen, Shuming
    Speedup Analysis of Data-parallel Applications on Multi-core NoCs2009In: Proceedings of the IEEE International Conference on ASIC (ASICON), 2009, p. 105-108Conference paper (Refereed)
    Abstract [en]

    As more computing cores are integrated onto a single chip, the effect of network communication latency is becoming more and more significant on Multi-core Network-onChips (NoCs). For data-parallel applications, we study the model ofparallel speedup by including network communication latency in Amdahl's law. The speedup analysis considers the effect of network topology, network size, traffic model and computation/communication ratio. We also study the speedup efficiency. In our Multi-core NoC platform, a real data-parallel application, i.e. matrix multiplication, is used to validate the analysis. Our theoretical analysis and the application results show that the speedup improvement is nonlinear and the speedup efficiency decreases as the system size is scaled up. Such analysis can be used to guide architects and programmers to improve parallel processing efficiency by reducing network latency with optimized network design and increasing computation proportion in the program.

  • 12.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Supporting Distributed Shared Memory on Multi-core Network-on-Chips Using a Dual Microcoded Controller2010In: Proceedings of the confernece for Design Automation and Test in Europe, 2010, p. 39-44Conference paper (Refereed)
    Abstract [en]

    Supporting Distributed Shared Memory (DSM) is essential for multi-coreNetwork-on-Chips for the sake of reusing huge amount of legacy code and easy programmability. We propose a microcoded controller as a hardware module in each node to connect the core, the local memory and the network. The controller is programmable where the DSM functions such as virtual-to-physical address translation,memory access and synchronization etc. are realized using microcode. To enable concurrent processing of memory requests from the local and remote cores, ourcontroller features two mini-processors, one dealing with requests from the local coreand the other from remote cores. Synthesis results suggest that the controller consumes 51k gates for the logic and can run up to 455 MHz in 130 nm technology. To evaluate its performance, we use synthetic and application workloads. Results show that, when the system size is scaled up, the delay overhead incurred by the controller may become less significant when compared with the network delay. In this way, the delay efficiency of our DSM solution is close to hardware solutions on average but still have all the flexibility of software solutions.

  • 13.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Supporting Efficient Synchronization in Multi-core NoCs Using Dynamic Buffer Allocation Technique2010In: Proceedings of the IEEE Annual Symposium on VLSI, 2010, p. 462-463Conference paper (Refereed)
    Abstract [en]

    This paper explores a dynamic buffer allocation technique to guide a distributedsynchronization architecture to support efficient synchronization on multi-core Network-on-Chips (NoCs). The synchronization architecture features two physical buffers to be able to concurrently queue and handle synchronization requests issued by the local processor and remote processors via the on-chip network. Using the dynamic bufferallocation technique, the two physical buffers are dynamically allocated to form multiple virtual buffers in order to improve buffers' utilization. Experiments are carried on to evaluate buffers' utilization.

  • 14.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Chen, Shenggang
    Gu, Huitao
    Reducing Virtual-to-Physical address translation overhead in Distributed Shared Memory based multi-core Network-on-Chips according to data property2013In: Computers & electrical engineering, ISSN 0045-7906, E-ISSN 1879-0755, Vol. 39, no 2, p. 596-612Article in journal (Refereed)
    Abstract [en]

    In Network-on-Chip (NoC) based multi-core platforms, Distributed Shared Memory (DSM) preferably uses virtual addressing in order to hide the physical locations of the memories. However, this incurs performance penalty due to the Virtual-to-Physical (V2P) address translation overhead for all memory accesses. Based on the data property which can be either private or shared, this paper proposes a hybrid DSM which partitions a local memory into a private and a shared part. The private part is accessed directly using physical addressing and the shared part using virtual addressing. In particular, the partitioning boundary can be configured statically at design time and dynamically at runtime. The dynamic configuration further removes the V2P address translation overhead for those data with changeable property when they become private at runtime. In the experiments with three applications (matrix multiplication, 2D FFT, and H.264/AVC encoding), compared with the conventional DSM, our techniques show performance improvement up to 37.89%.

  • 15.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. National University of Defense Technology, China .
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Guo, Yang
    Liu, Hengzhu
    Cooperative communication for efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs2014In: IEICE Electronics Express, ISSN 1349-2543, E-ISSN 1349-2543, Vol. 11, no 18, p. 20140542-Article in journal (Refereed)
    Abstract [en]

    On many-core Network-on-Chips (NoCs), communication is on the critical path of system performance and contended synchronization requests may cause large performance penalty. Different from conventional algorithm-based approaches, the paper addresses the barrier synchronization problem from the angle of optimizing its communication performance and proposes cooperative communication as a means to achieve efficient and scalable all-to-all barrier synchronization on mesh-based many-core NoCs. With the cooperative communication, routers collaborate with one another to accomplish a fast barrier synchronization task. The cooperative communication is implemented in our router at low cost. Through comparative experiments, our approach evidently exhibits high efficiency and good scalability.

  • 16.
    Chen, Xiaowen
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Shuming
    Liu, Hai
    Cooperative communication based barrier synchronization in on-chip mesh architectures2011In: IEICE ELECTRON EXPR, ISSN 1349-2543, Vol. 8, no 22, p. 1856-1862Article in journal (Refereed)
    Abstract [en]

    We propose cooperative communication as a means to enable efficient and scalable barrier synchronization on mesh-based many-core architectures. Our approach is different from but orthogonal to conventional algorithm-based optimizations. It relies on collaborating routers to provide efficient gather and multicast communication. In conjunction with a master-slave algorithm, it exploits the mesh regularity to achieve efficiency. The gather and multicast functions have been implemented in our router. Synthesis results suggest marginal area overhead. With synthetic and benchmark experiments, we show that our approach significantly reduces synchronization completion time and increases speedup.

  • 17.
    Chen, Xiaowen
    et al.
    KTH, School of Electrical Engineering (EES).
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Centres, VinnExcellence Center for Intelligence in Paper and Packaging, iPACK. KTH, School of Information and Communication Technology (ICT), Electronics.
    Liu, S.
    Chen, S.
    Round-trip DRAM access fairness in 3D NoC-based many-core systems2017In: ACM Transactions on Embedded Computing Systems, ISSN 1539-9087, E-ISSN 1558-3465, Vol. 16, no 5s, article id 162Article in journal (Refereed)
    Abstract [en]

    In 3D NoC-based many-core systems, DRAM accesses behave differently due to their different communication distances and the latency gap of different DRAM accesses becomes bigger as the network size increases, which leads to unfair DRAM access performance among different nodes. This phenomenon may lead to high latencies for some DRAM accesses that become the performance bottleneck of the system. The paper addresses the DRAM access fairness problem in 3D NoC-based many-core systems by narrowing the latency difference of DRAM accesses as well as reducing the maximum latency. Firstly, the latency of a round-trip DRAM access is modeled and the factors causing DRAM access latency difference are discussed in detail. Secondly, the DRAM access fairness is further quantitatively analyzed through experiments. Thirdly, we propose to predict the network latency of round-trip DRAM accesses and use the predicted round-trip DRAM access time as the basis to prioritize the DRAM accesses in DRAM interfaces so that the DRAM accesses with potential high latencies can be transferred as early and fast as possible, thus achieving fair DRAM access. Experiments with synthetic and application workloads validate that our approach can achieve fair DRAM access and outperform the traditional First-Come-First-Serve (FCFS) scheduling policy and the scheduling policies proposed by reference [7] and [24] in terms of maximum latency, Latency Standard Deviation (LSD)1 and speedup. In the experiments, the maximum improvement of the maximum latency, LSD, and speedup are 12.8%, 6.57%, and 8.3% respectively. Besides, our proposal brings very small extra hardware overhead (<0.6%) in comparison to the three counterparts.

  • 18.
    Jantsch, Axel
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Naeem, Abdul
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Zhang, Yuang
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Penolazzi, Sandro
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Memory Architecture and Management in an NoC Platform2011In: Scalable Multi-core Architectures: Design Methodologies and Tools / [ed] Axel Jantsch and Dimitrios Soudris, Springer, 2011, 1, p. 3-28Chapter in book (Refereed)
    Abstract [en]

    The memory organization and the management of the memory space is a critical part of every NoC based platform design. We propose a Data Management Engine (DME), that is a block of programmable hardware and part of every processing element. It off-loads the processing element (CPU, DSP, etc.) by managing the memory space, memory access and the communication over the on-chip network. The DME’s main functions are virtual address translation, private and shared memory management, cache coherence protocol, support for memory consistency models, synchronization and protection mechanisms for shared memory communication. The DME is fully programmable and configurable thus allowing for customized support for high level data management functions such as dynamic memory allocation and abstract data types. This chapter describes the main concepts, design and functionality of the DME and presents case studies illustrating its usage and performance.

  • 19. Li, Yang
    et al.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. College of Computer, National University of Defense Technology, China .
    Zhao, Xiaohui
    Yang, Yong
    Liu, Hengzhu
    Round-trip latency prediction for memory access fairness in mesh-based many-core architectures2014In: IEICE Electronics Express, ISSN 1349-2543, E-ISSN 1349-2543, Vol. 11, no 24, p. 20141027-Article in journal (Refereed)
    Abstract [en]

    In mesh-based many-core architectures, processor cores and memories reside in different locations (center, corner, edge, etc.), therefore memory accesses behave differently due to their different communication distances. The latency difference leads to unfair memory access and some memory accesses with very high latencies, degrading the system performance. However, improving one memory access's latency can worsen the latency of another since memory accesses contend in the network. Therefore, the goal should focus on memory access fairness through balancing the latencies of memory accesses while ensuring a low average latency. In the paper, we address the goal by proposing to predict the round-trip latencies of memory access related packets and use the predicted round-trip latencies to prioritize the packets. The router supporting fair memory access is designed and its hardware cost is given. Experiments are carried out with a variety of network sizes and packet injection rates and prove that our approach outperforms the classic round-robin arbitration in terms of average latency and LSD1. In the experiments, the maximum improvement of the average latency and the LSD are 16% and 48% respectively.

  • 20.
    Naeem, Abdul
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Realization and Performance Comparison of Sequential and Weak Memory Consistency Models in Network-on-Chip based Multi-core Systems2011In: Proceedings of 16th ACM/IEEE Asia and South Pacific Design Automation Conference(ASP-DAC) 2011, IEEE Press, 2011, p. 154-159Conference paper (Refereed)
    Abstract [en]

    This paper studies realization and performance comparison of the sequential and weak consistency models in the network-on-chip (NoC) based distributed shared memory (DSM) multi-ore systems. Memory consistency constrains the order of shared memory operations for the expected behavior of the multi-core systems. Both the consistency models are realized in the NoC based multi-core systems. The performance of the two consistency models are compared for various sizes of networks using regular mesh topologies and deflection routing algorithm. The results show that the weak consistency improves the performance by 46.17% and 33.76% on average in the code and consistency latencies over the sequential consistency model, due to relaxation in the program order, as the system grows from single core to 64 cores.

  • 21.
    Naeem, Abdul
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Scalability of Relaxed Consistency Models in NoC based Multicore Architectures2009In: SIGARCH Computer Architecture News, ISSN 0163-5964, E-ISSN 1943-5851, Vol. 37, no 5, p. 8-15Article in journal (Other academic)
    Abstract [en]

    This paper studies realization of relaxed memory consistency models in the network-on-chip based distributed shared memory (DSM) multi-core systems. Within DSM systems, memory consistency is a critical issue since it affects not only the performance but also the correctness of programs. We investigate the scalability of the relaxed consistency models (weak, release consistency) implemented by using transaction counters. Our experimental results compare the average and maximum code, synchronization and data latencies of the two consistency models for various network sizes with regular mesh topologies. The observed latencies rise for both the consistency models as the network size grows. However, the scaling behaviors are different. With the release consistency model these latencies grow significantly slower than with the weak  onsistency due to better optimization potential by means of overlapping, reordering and program order relaxations. The release consistency improves the performance by 15.6% and 26.5% on average in the code and consistency latencies over the weak consistency model for the specific application, as the system grows from single core to 64 cores. The latency of data transactions  rows 2.2 times faster on the average with a weak consistency model than with a release consistency model when the system scales from single core to 64 cores.

  • 22.
    Naeem, Abdul
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Scalability of Weak Consistency in NoC based Multicore Architectures2010In: IEEE INT SYMP CIRC SYST PROC, New York: IEEE , 2010, p. 3497-3500Conference paper (Refereed)
    Abstract [en]

    In Multicore Network-on-Chip, it is preferable to realize distributed but shared memory (DSM) in order to reuse the huge amount of legacy code and easy programming. Within DSM systems, memory consistency is a critical issue since it affects not only performance but also the correctness of programs. In this paper, we investigate the scalability of the weak consistency model, which may be implemented using a transaction counter. The experimental results compare synchronization latencies for various network sizes, topologies and lock positions in the network. Average synchronization latency rises exponentially for mesh and torus topologies as the network size grows. However, torus improves the synchronization latency in comparison to mesh. For mesh topology network average synchronization latency is also slightly affected by the lock position with respect to the network center.

  • 23.
    Naeem, Abdul
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Realization and Scalability of Release and Protected Release Consistency Models in NoC based Systems2011In: Proceeding of 14th Euromicro Conference on Digital System Design, 2011, Oulu: IEEE Computer Society, 2011, p. 47-54Conference paper (Refereed)
    Abstract [en]

    This paper studies the realization and scalability of release and protected release consistency models in Network-on-Chip (NoC) based Distributed Shared Memory (DSM) multi-core systems. The protected release consistency (PRC) model is proposed as an extension of the release consistency (RC) model and provides further relaxation in the shared memory operations. The realization schemes of RC and PRC models use a transaction counter in each node of the NoC based multi-core (McNoC) systems. Further, we study the scalability of these RC and PRC models and evaluate their performance in the McNoC platform. A configurable NoC based platform with 2D mesh topology and deflection routing algorithm is used in the tests. We experiment both with synthetic and application workloads. The performance of the RC and PRC models are compared using sequential consistency (SC) as the baseline. The experiments show that the average code execution time for the PRC model in 8x8 network (64 cores) is reduced by 30.5% over SC, and by 6.5% over RC model. Average data execution time in the 8x8 network for the PRC model is reduced by almost 37% over SC and by 8.8% over RC. The increase in area for the PRC of RC is about 880 gates in the network interface ( 1.7% ).

  • 24. Wang, Z.
    et al.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronics. National University of Defense Technology, China.
    Li, C.
    Guo, Y.
    Fairness-oriented and location-aware NUCA for many-core SoC2017In: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Association for Computing Machinery (ACM), 2017, article id 13Conference paper (Refereed)
    Abstract [en]

    Non-uniform cache architecture (NUCA) is often employed to organize the last level cache (LLC) by Networks-on-Chip (NoC). However, along with the scaling up for network size of Systems-on-Chip (SoC), two trends gradually begin to emerge. First, the network latency is becoming the major source of the cache access latency. Second, the communication distance and latency gap between different cores is increasing. Such gap can seriously cause the network latency imbalance problem, aggravate the degree of non-uniform for cache access latencies, and then worsen the system performance. In this paper, we propose a novel NUCA-based scheme, named fairness-oriented and location-aware NUCA (FL-NUCA), to alleviate the network latency imbalance problem and achieve more uniform cache access. We strive to equalize network latencies which are measured by three metrics: average latency (AL), latency standard deviation (LSD), and maximum latency (ML). In FL-NUCA, the memory-to-LLC mapping and links are both non-uniform distributed to better fit the network topology and traffics, thereby equalizing network latencies from two aspects, i.e., non-contention latencies and contention latencies, respectively. The experimental results show that FL-NUCA can effectively improve the fairness of network latencies. Compared with the traditional static NUCA (SNUCA), in simulation with synthetic traffics, the average improvements for AL, LSD, and ML are 20.9%, 36.3%, and 35.0%, respectively. In simulation with PARSEC benchmarks, the average improvements for AL, LSD, and ML are 6.3%, 3.6%, and 11.2%, respectively.

  • 25. Wang, Z.
    et al.
    Chen, Xiaowen
    KTH, School of Electrical Engineering (EES).
    Li, C.
    Guo, Y.
    Fairness-oriented switch allocation for networks-on-chip2017In: 2017 30th IEEE International System-on-Chip Conference (SOCC), IEEE Computer Society, 2017, p. 304-309Conference paper (Refereed)
    Abstract [en]

    Networks-on-Chip (NoC) is becoming the backbone of modern chip multiprocessor (CMP) systems. However, with the number of integrated cores increasing and the network size scaling up, the network-latency imbalance is becoming an important problem, which seriously influences the performance of the network and system. In this paper, we aim to alleviate this problem by optimizing the design of switch allocation. We propose fairness-oriented switch allocation (FOSA), a novel switch allocation strategy to achieve uniform network latencies. FOSA can improve system performance by achieving remarkable improvement in balancing network latencies. We evaluate the network and system performance of FOSA with synthetic traffics and SPEC CPU2006 benchmarks in a full-system simulator. Compared with the canonical separable switch allocator (Round-Robin) and the recently proposed switch allocator (TS-Router), the experiments with benchmarks show that our approach decreases maximum latency (ML) by 45.6% and 15.1%, respectively, as well as latency standard deviation (LSD) by 13.8% and 3.9%, respectively. Besides this, FOSA improves system throughput by 0.8% over that of TS-Router. Finally, we synthesize FOSA and give an evaluation of the additional consumption of area and power.

1 - 25 of 25
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf