Change search
Refine search result
1234 51 - 100 of 193
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 51.
    Hu, Wenmin
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Liu, Hengzhu
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    TPSS: A flexible hardware support for unicast and multicast on networks-on-chip2012In: Journal of Computers, ISSN 1796-203X, E-ISSN 1796-203X, Vol. 7, no 7, p. 1743-1752Article in journal (Refereed)
    Abstract [en]

    Multicast is an important traffic mode that runs on multi-core systems, and an efficient hardware support for multicast can greatly improve the performance of the whole system. Most multicast solutions use the dimension-order routing to generate the mutlicast trees, which are neither bandwidth nor power efficient. This article presents a synthesizable router for network-on-chip (NoC) which supports arbitrarily shaped multicast path based on a mesh topology. In our scheme, incremental setup is adopted to simplify the process of multicast tree construction. For each sub-path setup, we present a novel scheme called two period sub-path setup (TPSS). TPSS is divided into two periods: routing to a predeterminate intermediate router, and updating lookup tables from the intermediate router to destination. This novel setup makes it feasible to support arbitrarily shaped path setup. In our case study, Optimized tree algorithm (OPT) and Left-XY-Right-Optimized tree algorithm (LXYROPT) are proposed for power-efficient path searching, but they need to be pre-configured for the reason of high computation cost. Moreover, Virtual Circuit Tree Multicasting (VCTM) is also supported in our scheme for dynamic construction of multicast path, which needs no computation in path searching. The performance is evaluated by using a cycle accurate simulator developed in SystemC, and the hardware overhead is estimated by using a synthesizable HDL model. Compared to VCTM (without FIFO, multicast table and network adapter), the area overhead of implementing our router is negligible (less than 0.5%).

  • 52.
    Jafari, Fahimeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Output Process of Variable Bit-Rate Flows in On-Chip Networks Based on Aggregate Scheduling2011In: Proceedings of the International Conference on Computer Design, 2011, p. 445-446Conference paper (Refereed)
    Abstract [en]

     In NoCs often several flows are merged into one aggregate flow due to heavy resource sharing. For strengthening formal performance analysis, we propose an improved model for an output flow of a FIFO multiplexer under aggregate scheduling. The model of the aggregate flow is formally proven and can serve as the basis for a stringent worst case delay and buffer analysis.

  • 53. Jafari, Fahimeh
    et al.
    Jantsch, Axel
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Weighted Round Robin Configuration for Worst-Case Delay Optimization in Network-on-Chip2016In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 24, no 12, p. 3387-3400Article in journal (Refereed)
    Abstract [en]

    We propose an approach for computing the end-to-end delay bound of individual variable bit-rate flows in an First Input First Output multiplexer with aggregate scheduling under weighted round robin (WRR) policy. To this end, we use a network calculus to derive per-flow end-to-end equivalent service curves employed for computing least upper delay bounds (LUDBs) of the individual flows. Since the real-time applications are going to meet guaranteed services with lower delay bounds, we optimize the weights in WRR policy to minimize the LUDBs while satisfying the performance constraints. We formulate two constrained delay optimization problems, namely, minimize-delay and multiobjective optimization. Multiobjective optimization has both the total delay bounds and their variance as the minimization objectives. The proposed optimizations are solved using a genetic algorithm. A video object plane decoder case study exhibits a 15.4% reduction of the total worst case delays and a 40.3% reduction on the variance of delays when compared with round robin policy. The optimization algorithm has low run-time complexity, enabling quick exploration of the large design spaces. We conclude that an appropriate weight allocation can be a valuable instrument for the delay optimization in on-chip network designs.

  • 54. Jafari, Fahimeh
    et al.
    Jantsch, Axel
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Weighted Round Robin Configuration for Worst-Case Delay Optimization in Network-on-Chip2016In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 24, no 12, p. 3387-3400Article in journal (Refereed)
    Abstract [en]

    We propose an approach for computing the end-to-end delay bound of individual variable bit-rate flows in an First Input First Output multiplexer with aggregate scheduling under weighted round robin (WRR) policy. To this end, we use a network calculus to derive per-flow end-to-end equivalent service curves employed for computing least upper delay bounds (LUDBs) of the individual flows. Since the real-time applications are going to meet guaranteed services with lower delay bounds, we optimize the weights in WRR policy to minimize the LUDBs while satisfying the performance constraints. We formulate two constrained delay optimization problems, namely, minimize-delay and multiobjective optimization. Multiobjective optimization has both the total delay bounds and their variance as the minimization objectives. The proposed optimizations are solved using a genetic algorithm. A video object plane decoder case study exhibits a 15.4% reduction of the total worst case delays and a 40.3% reduction on the variance of delays when compared with round robin policy. The optimization algorithm has low run-time complexity, enabling quick exploration of the large design spaces. We conclude that an appropriate weight allocation can be a valuable instrument for the delay optimization in on-chip network designs.

  • 55.
    Jafari, Fahimeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Weighted Round Robin Configura- tion for Worst-Case Delay Optimization in Network-on-ChipManuscript (preprint) (Other academic)
  • 56.
    Jafari, Fahimeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Worst-Case Delay Analysis of Variable Bit-Rate Flows in Network-on-Chip with Aggregate Scheduling2012In: Proceedings of the Design and Test in Europe Conference (DATE), 2012, p. 538-541Conference paper (Refereed)
    Abstract [en]

    Aggregate scheduling in routers merges several flows into one aggregate flow. We propose an approach for computing the end-to-end delay bound of individual flows in a FIFO multiplexer under aggregate scheduling. A synthetic case study exhibits that the end-to-end delay bound is up to 33.6% tighter than the case without considering the traffic peak behavior.

  • 57.
    Jafari, Fahimeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Least Upper Delay Bound for VBR Flows in Networks-on- Chip with Virtual ChannelsIn: ACM Transactions on Design Automation of Electronic Systems, ISSN 1084-4309, E-ISSN 1557-7309Article in journal (Refereed)
  • 58.
    Jafari, Fahimeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    Least Upper Delay Bound for VBR Flows in Networks-on-Chip with Virtual Channels2015In: ACM Transactions on Design Automation of Electronic Systems, ISSN 1084-4309, E-ISSN 1557-7309, Vol. 20, no 3, article id 35Article in journal (Refereed)
    Abstract [en]

    Real-time applications such as multimedia and gaming require stringent performance guarantees, usually enforced by a tight upper bound on the maximum end-to-end delay. For FIFO multiplexed on-chip packet switched networks we consider worst-case delay bounds for Variable Bit-Rate (VBR) flows with aggregate scheduling, which schedules multiple flows as an aggregate flow. VBR Flows are characterized by a maximum transfer size (L), peak rate (p), burstiness (sigma), and average sustainable rate (rho). Based on network calculus, we present and prove theorems to derive per-flow end-to-end Equivalent Service Curves (ESC), which are in turn used for computing Least Upper Delay Bounds (LUDBs) of individual flows. In a realistic case study we find that the end-to-end delay bound is up to 46.9% more accurate than the case without considering the traffic peak behavior. Likewise, results also show similar improvements for synthetic traffic patterns. The proposed methodology is implemented in C++ and has low run-time complexity, enabling quick evaluation for large and complex SoCs.

  • 59.
    Jafari, Fahimeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. Ferdowsi University of Mashhad, Iran .
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Yaghmaee, Mohammad H.
    Ferdowsi Univ Mashhad, Dept Comp, Fac Engn, Mashhad, Iran.
    Optimal Regulation of Traffic Flows in Networks-on-Chip2010In: Proceedings of the Design Automation and Test Europe Conference (DATE), IEEE Computer Society, 2010, p. 1621-1624Conference paper (Refereed)
    Abstract [en]

    We have proposed (σ, ρ)-based flow regulation to reduce delay and backlog bounds in SoC architectures, where σ bounds the traffic burstiness and ρ the traffic rate. The regulation is conducted per-flow for its peak rate and traffic burstiness. In this paper, we optimize these regulation parameters in networks on chips where many flows may have conflicting regulation requirements. We formulate an optimization problem for minimizing total buffers under performance constraints. We solve the problem with the interior point method. Our case study results exhibit 48% reduction of total buffers and 16% reduction of total latency for the proposed problem. The optimization solution has low run-time complexity, enabling quick exploration of large design space.

  • 60.
    Jafari, Fahimeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Yaghmaee, Mohammad Hossein
    Buffer Optimization in Network-on-Chip Through Flow Regulation2010In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 29, no 12, p. 1973-1986Article in journal (Refereed)
    Abstract [en]

    For network-on-chip (NoC) designs, optimizing buffers is an essential task since buffers are a major source of cost and power consumption. This paper proposes flow regulation and has defined a regulation spectrum as a means for system-on-chip architects to control delay and backlog bounds. The regulation is performed per flow for its peak rate and burstiness. However, many flows may have conflicting regulation requirements due to interferences with each other. Based on the regulation spectrum, this paper optimizes the regulation parameters aiming for buffer optimization. Three timing-constrained buffer optimization problems are formulated, namely, buffer size minimization, buffer variance minimization, and multiobjective optimization, which has both buffer size and variance as minimization objectives. Minimizing buffer variance is also important because it affects the modularity of routers and network interfaces. A realistic case study exhibits 62.8% reduction of total buffers, 84.3% reduction of total latency, and 94.4% reduction on the sum of variances of buffers. Likewise, the experimental results demonstrate similar improvements in the case of synthetic traffic patterns. The optimization algorithm has low run-time complexity, enabling quick exploration of large design spaces. This paper concludes that optimal flow regulation can be a highly valuable instrument for buffer optimization in NoC designs.

  • 61.
    Jantsch, Axel
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Xiaowen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Naeem, Abdul
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Zhang, Yuang
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Penolazzi, Sandro
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Memory Architecture and Management in an NoC Platform2011In: Scalable Multi-core Architectures: Design Methodologies and Tools / [ed] Axel Jantsch and Dimitrios Soudris, Springer, 2011, 1, p. 3-28Chapter in book (Refereed)
    Abstract [en]

    The memory organization and the management of the memory space is a critical part of every NoC based platform design. We propose a Data Management Engine (DME), that is a block of programmable hardware and part of every processing element. It off-loads the processing element (CPU, DSP, etc.) by managing the memory space, memory access and the communication over the on-chip network. The DME’s main functions are virtual address translation, private and shared memory management, cache coherence protocol, support for memory consistency models, synchronization and protection mechanisms for shared memory communication. The DME is fully programmable and configurable thus allowing for customized support for high level data management functions such as dynamic memory allocation and abstract data types. This chapter describes the main concepts, design and functionality of the DME and presents case studies illustrating its usage and performance.

  • 62.
    Jantsch, Axel
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Resource Allocation for QoS On-Chip Communication2009In: Networks-on-Chips: Theory and Practice / [ed] Fayez Gebali; Haytham Elmiligi; Mohamed Watheq El-Kharashi, CRC Press, 2009Chapter in book (Refereed)
  • 63. Jiang, X.
    et al.
    Li, D.
    Xiao, P.
    Liu, S.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lessons of IOT effects on backbone networks learnt from traffic characteristics2012In: Int. Conf. Wirel. Commun., Networking Mob. Comput., WiCOM, 2012Conference paper (Refereed)
    Abstract [en]

    Emerging and blooming internet of things (IOT)technology greatly affects backbone network development. The effects are not accurately tested or predicted yet. As a trial, some traffic characteristics profiling cases are investigated based on a gateway in IOT (IOTGW), a classical device, for data communication between IOT and backbone networks. From this investigation, some lessons of IOT effects on backbone networks are learnt by taking application, network size, data volume, etc. into account. To our knowledge, this study is the first trial to investigate the IOT effects on backbone networks from the traffic perspective.

  • 64. Jiang, X.
    et al.
    Nie, S.
    Luo, J.
    Li, D.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    An enhanced iot gateway in a broadcast system2012In: Proceedings - IEEE 9th International Conference on Ubiquitous Intelligence and Computing and IEEE 9th International Conference on Autonomic and Trusted Computing, UIC-ATC 2012, IEEE , 2012, p. 746-751Conference paper (Refereed)
    Abstract [en]

    Internet of things (IOT) technology increasingly affects backbone network development. Aiming to reduce or weaken such effects on backbone networks from a traffic perspective and attain no discount on functions in a CobraNet based digital broadcast system, an enhanced IOT gateway and its evaluation system are presented. Based on the prototype evaluation system, experiments demonstrate the IOT gateway not only inherits the conventional functions previously often carried out by a PC platform or a server, but also develops additional abilities to cut down traffic under some cases and provides portable management for the system at the same time. According to the experimental results, it is possible from a traffic perspective to design an enhanced IOT gateway to provide value added service such as file transfer, message function, etc. in addition to high quality audio service by taking advantage of the verified sub-linear information addition mechanism. Such enhanced gateway can be properly adapted to other kinds of applications according to the nondemanding system requirements of the prototype evaluation system and proposed IOTGW design.

  • 65. Liu, M.
    et al.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kuehn, W.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A survey of FPGA dynamic reconfiguration design methodology and applications2012In: International Journal of Embedded and Real-Time Communication Systems, ISSN 1947-3176, Vol. 3, no 2, p. 23-39Article, review/survey (Refereed)
    Abstract [en]

    FPGA Dynamic Partial Reconfiguration (DPR or PR) technology has emerged and become gradually mature in the recent years. It provides the Time-Division Multiplexing (TDM) capability in utilizing on-chip resources and leads to significant benefits in comparison with conventional static designs. However, the partially reconfigurable design process features additional complexity and technical requirements to the FPGA developers. Hence, PR design approaches are being widely explored and investigated to systematize the development methodology and ease the designers. In this paper, the authors collect several research and engineering projects in this area and present a survey of the design methodology and applications of PR. Research aspects are discussed in various hardware/software layers.

  • 66.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jin, Dapeng
    Kopp, Andreas
    Kuehn, Wolfgang
    Lang, Johannes
    Li, Lu
    Lange, Soeren
    Liu, Zhen’an
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Muenchow, David
    Pechenov, Vladimir
    Roskoss, Johannes
    Spataro, Stephano
    Wang, Qiang
    Xu, Hao
    Trigger algorithm development on FPGA-based Compute Nodes2009In: 2009 16th IEEE-NPSS Real Time Conference, New York: IEEE , 2009, p. 478-484Conference paper (Refereed)
    Abstract [en]

    Based on the ATCA computation architecture and Compute Nodes (CN), investigation and implementation work has been being executed for HADES and PANDA trigger algorithms. We present our designs for HADES track reconstruction processing, Cherenkov ring recognition, Time-Of-Flight processing, electromagnetic shower recognition.. and the PANDA straw tube tracking algorithm. They will appear as co-processors in the uniform system design to undertake the detector-specific computing. The algorithm principles will be explained and hardware designs are described in the paper. The current progress reveals the feasibility to implement these algorithms on FPGAs. Also experimental results demonstrate the performance speedup when compared to alternative software solutions, as well as the potential capability of high-speed parallel/pipelined processing in Data Acquisition and Trigger systems.

  • 67. Liu, Ming
    et al.
    Kuehn, Wolfgang
    Lange, Soeren
    Yang, Shuo
    Roskoss, Johannes
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Wang, Qiang
    Xu, Hao
    Jin, Dapeng
    Liu, Zhen'an
    A High-End Reconfigurable Computation Platform for Nuclear and Particle Physics Experiments2011In: Computing in science & engineering (Print), ISSN 1521-9615, E-ISSN 1558-366X, Vol. 13, no 2, p. 52-63Article in journal (Refereed)
    Abstract [en]

    A high-performance computation platform based on field-programmable gate arrays targets nuclear and particle physics experiment applications. The system can be constructed or scaled into a supercomputer-equivalent size for detector data processing by inserting compute nodes into advanced telecommunications computing architecture (ATCA) crates. Among the case study results are that one ATCA crate can provide a computation capability equivalent to hundreds of commodity PCs for Hades online particle track reconstruction and Cherenkov ring recognition.

  • 68.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Kuehn, Wolfgang
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Run-time Partial Reconfiguration Speed Investigation and Architectural Design Space Exploration2009In: FPL 09: 19th International Conference on Field Programmable Logic and Applications / [ed] Danek, M; Kadlec, J, 2009, p. 498-502Conference paper (Refereed)
    Abstract [en]

    Run-time Partial Reconfiguration (PR) speed is significant in applications especially when fast IP core switching is required. In this paper, we propose to use Direct Memory Access (DMA), Master (MST) burst, and a dedicated Block RAM (BRAM) cache respectively to reduce the reconfiguration time. Based on the Xilinx PR technology and the Internal Configuration Access Port (ICAP) primitive in the FPGA fabric, we discuss multiple design architectures and thoroughly investigate their performance with measurements for different partial bitstream sizes. Compared to the reference OPB_HWICAP and XPS_HWICAP designs, experimental results show that DMA_HWICAP and MST_HWICAP reduce the reconfiguration time by one order of magnitude, with little resource consumption overhead. The BRAM_HWICAP design can even approach the reconfiguration speed limit of the ICAP primitive at the cost of large Block RAM utilization.

  • 69.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Kuehn, Wolfgang
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    System-on-an-FPGA Design for Real-time Particle Track Recognition and Reconstruction in Physics Experiments2008In: 11TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN - ARCHITECTURES, METHODS AND TOOLS: DSD 2008, PROCEEDINGS / [ed] Fanucci, L., LOS ALAMITOS: IEEE COMPUTER SOC , 2008, p. 599-605Conference paper (Refereed)
    Abstract [en]

    In particle physics experiments, the momenta of charged particles are studied by observing their deflection in a magnetic field. Dedicated detectors measure the particle tracks and complex algorithms are required for track recognition and reconstruction. This CPU-intensive task is usually implemented as off-line software running on PC clusters. In this paper we present a system-on-chip design for the track recognition and reconstruction based on modern FPGA technologies. The basic principle of the algorithm is polled from software into the FPGA fabric. The fundamental architecture of the tracking processor is described in detail. Working as processing engines in compute nodes, the tracking processor contributes to recognize potential track candidates in real-time and promotes the selection efficiency of the data acquisition and trigger system. Our design study shows that the tracking module can be integrated in a single Xilinx Virtex-4 FX60 FPGA. The processing capability of the design is about 16.7K sub-events per second per module with our experimental setup, which achieves 20 times speedup compared to the software implementation.

  • 70.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Kuehn, Wolfgang
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Yang, Shuo
    Perez, Tiago
    Liu, Zhenan
    Hardware/Software co-design of a general-purpose computation platform in particle physics2007In: ICFPT 2007: International Conference On Field-Programmable Technology, Proceedings / [ed] Amano, H; Ye, A; Ikenaga, T, 2007, p. 177-183Conference paper (Refereed)
    Abstract [en]

    In this paper we present a hardware/software co-design based computation platform for online data processing in particle physics experiments. Our goal is to ease and accelerate the development and make it universal and scalable for multiple applications, on the premise of guaranteeing high communicating and processing capabilities. The entire computation network consists of quite a few interconnected compute nodes, each of which has multiple FPGAs to implement specific algorithms for data processing. High-speed communication features including RocketIO multi-gigabit transceiver and Gigabit Ethernet are supported by FPGAs to construct internal and external connections. An embedded Linux operating system is fitted on the PowerPC CPU core inside the Xilinx Virtex-4 FX FPGA. Thus programmers can access hardware resources via device drivers and write application programs to manage the system from the high level. Furthermore measurements have been executed using the development board to investigate both communicating and processing performances of the system. Results show that the computation platform is able to communicate at a UDP/IP data rate of around 400 Mbps per Ethernet link, and the event selection engine could process an event rate of 25%.

  • 71.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Lang, Johannes
    Yang, Shuo
    Perez, Tiago
    Kuehn, Wolfgang
    Xu, Hao
    Jin, Dapeng
    Wang, Qiang
    Li, Lu
    Liu, Zhen’An
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    ATCA-based Computation Platform for Data Acquisition and Triggering in Particle Physics Experiments2008In: 2008 INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE AND LOGIC APPLICATIONS, VOLS 1 AND 2, 2008, p. 287-292Conference paper (Refereed)
    Abstract [en]

    An ATCA-based computation platform for data acquisition and trigger applications in nuclear and particle physics experiments has been developed. Each Compute Node (CN) which appears as a Field Replaceable Unit (FRU) in an ATCA shelf, features 5 Xilinx Virtex-4 FX60 FPGAs and up to 10 GBytes DDR2 memory. Connectivity is provided with 8 optical links and 5 Gigabit Ethernet ports, which are mounted on each board to receive data from detectors and forward results to outer shelves or PC farms with attached mass storage. Fast point-to-point on-board interconnections between FPGAs as well as the full-mesh shelf backplane provide flexibility and high bandwidth to partition algorithms and correlate results among them. The system represents a highly reconfigurable and scalable solution for multiple applications.

  • 72.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kuehn, Wolfgang
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Adaptively Reconfigurable Controller for the Flash Memory2011In: Flash Memories, InTech , 2011Chapter in book (Refereed)
  • 73.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kuehn, Wolfgang
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    FPGA-based adaptive computing for correlated multi-stream processing2010In: Proceedings -Design, Automation and Test in Europe, DATE, IEEE Computer Society, 2010, p. 973-976Conference paper (Refereed)
    Abstract [en]

    In conventional static implementations for correlated streaming applications, computing resources may be inefficiently utilized since multiple stream processors may supply their sub-results at asynchronous rates for result correlation or synchronization. To enhance the resource utilization efficiency, we analyze multi-streaming models and implement an adaptive architecture based on FPGA Partial Reconfiguration (PR) technology. The adaptive system can intelligently schedule and manage various processing modules during run-time. Experimental results demonstrate up to 78.2% improvement in throughput-per-unit- area on unbalanced processing of correlated streams, as well as only 0.3% context switching overhead in the overall processing time in the worst-case.

  • 74.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kuehn, Wolfgang
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    FPGA-based Cherenkov Ring Recognition in Nuclear and Particle Physics Experiments2011In: Reconfigurable Computing: Architectures, Tools And Applications / [ed] Koch, A; Krishnamurthy, R; McAllister, J; Woods, R; ElGhazawi, T, Springer, 2011, p. 169-180Conference paper (Refereed)
    Abstract [en]

    Cherenkov ring is often adopted to identify particles flying through the detector systems in nuclear and particle physics experiments. In this paper, we introduce an improved ring recognition algorithm and present its FPGA implementation. Compared to the previous implementation based on VMEBus and FPGAs, our design is evaluated to outperform by several tens up to hundred times with acceptable resource utilizations on a Xilinx Virtex-4 FX60 FPGA. The design module will reside in the online data acquisition (DAQ) and trigger facilities, and contribute to significantly reduce the data rate of storage for offline analysis by retaining only interesting events and dropping the noise. Our customized FPGA cluster in one ATCA [1] shelf is foreseen to achieve an equivalent computation capability up to thousands of commodity PCs for particle recognition.

  • 75.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kuehn, Wolfgang
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    FPGA-Based Particle Recognition in the HADES Experiment2011In: IEEE Design & Test of Computers, ISSN 0740-7475, E-ISSN 1558-1918, Vol. 28, no 4, p. 48-57Article in journal (Refereed)
  • 76.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kuehn, Wolfgang
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Inter-process communication using pipes in FPGA-based adaptive computing2010In: Proceedings - IEEE Annual Symposium on VLSI, ISVLSI 2010, 2010, p. 80-85Conference paper (Refereed)
    Abstract [en]

    In FPGA-based adaptive computing, Inter-Process Communications (IPC) are required to exchange information among hardware processes which time-multiplex the resources in a same reconfigurable region. In this paper, we use pipes for IPC and analyze the performance in terms of throughput, throughput efficiency and latency in switching contexts. We also present two practical implementations using FPGA BRAM and external DDR memory. Experimental results expose the key role that context switching plays in determining the IPC performance at various pipe sizes and data rates.

  • 77.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kuehn, Wolfgang
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Reducing FPGA Reconfiguration Time Overhead using Virtual Configurations2010In: Proceedings of the 5th International Workshop on Reconfigurable Communication Centric Systems-on-Chip, 2010, p. 149-152Conference paper (Refereed)
    Abstract [en]

    Reconfiguration time overhead is a critical factor in determining the system performance of FPGA dynamically reconfigurable designs. To reduce the reconfiguration overhead, the most straightforward way is to increase the reconfiguration throughput, as many previous contributions did. In addition to shortening FPGA reconfiguration time, we introduce a new concept of Virtual ConFigurations (VCF) in this paper, hiding dynamic reconfiguration time in the background to reduce the overhead. Experimental results demonstrate up to 29.9% throughput enhancement by adopting two VCFs in a consumerreconfigurable design. The packet latency performance is also largely improved by extending the channel saturation to a higher packet injection rate.

  • 78.
    Liu, Ming
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Kuehn, Wolfgang
    Yang, Shuo
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    A Reconfigurable Design Framework for FPGA Adaptive Computing2009In: 2009 INTERNATIONAL CONFERENCE ON RECONFIGURABLE COMPUTING AND FPGAS, IEEE , 2009, p. 439-444Conference paper (Refereed)
    Abstract [en]

    Partial Reconfiguration (PR) offers the possibility to adaptively change part of the FPGA design without stopping the remaining system. In this paper, we present a comprehensive framework for adaptive computing, in which design key points of hardware processes, system interconnections, Operating Systems (OS), device drivers, scheduler software as well as context switching are respectively concerned in different hardware/software layers. A case study is discussed to demonstrate an example of swapping a Flash memory controller and an SRAM controller in response to diverse memory access needs. Result analysis reveals a more efficient resource utilization of 52.1% I/O pads, 86.5% LUTs and 81.3% Flip-Flops, when compared to the static design with same functionalities. A small reconfiguration overhead of context switching is measured within the range from hundreds of microseconds to milliseconds. Moreover, technical perspectives are analyzed and it is foreseen to obtain great benefits with the proposed design framework in object applications of particle physics experiments.

  • 79.
    Liu, Shaoteng
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Axel, Jantsch
    TU Wien, Vienna, Austria.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    MultiCS: Circuit switched NoC with multiple sub-networks and sub-channels2015In: Journal of systems architecture, ISSN 1383-7621, E-ISSN 1873-6165Article in journal (Refereed)
    Abstract [en]

    We propose a multi-channel and multi-network circuit switched NoC (MultiCS) with a probe searching setup method to explore different channel partitioning and configuration policies. Our design has a variable number of channels which can be configured either as sub-channels (spatial division multiplexing channels) or sub-networks. Packets can be delivered on an established connection with one or multiple channels. An adaptive channel allocation scheme, which determines a connection width according to the dynamic use of channels, can greatly reduce the delay, compared to a deterministic allocation scheme. However, the latter can offer exact connection width as requested. The benefits and burden of using different number of channels and configurations are studied by analysis and experiments. Our experimental results show that sub-network configurations are superior to sub-channel configurations in delay and throughput, when working at the highest clock frequency of each configuration. Under reasonable channel partitioning, sub-networks with narrow channels can generally achieve higher throughput than the network using single wide channels.

  • 80.
    Liu, Shaoteng
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A Fair and Maximal Allocator for Single-Cycle On-Chip Homogeneous Resource Allocation2014In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 23, no 10, p. 2229-2233Article in journal (Refereed)
    Abstract [en]

    Traditional allocators for network-on-chip (NoC) routers suffer from either poor-matching quality or limited fairness. We propose a waterfall (WTF) allocator targeting homogeneous resource allocation, which provides single-cycle maximal matching while guaranteeing strong fairness based on the round-robin principle. It can be implemented with a loop-free structure. In 90 nm technology, the allocator operates at about 1 GHz clock frequency. We compare WTF with wave-front, separable-input-first, and separable-output-first allocators and find that it is at least 10% smaller, has 50% less delay under high load, and uses 3% less power than any of these alternatives. Also, WTF is at least as fair or clearly fairer. We also find that in a 4 x 4 circuit switched NoC the use of WTF gives up to 20% higher network performance.

  • 81.
    Liu, Shaoteng
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Analysis and evaluation of circuit switched NoC and packet switched NoC2013In: Proceedings - 16th Euromicro Conference on Digital System Design, DSD 2013, IEEE , 2013, p. 21-28Conference paper (Refereed)
    Abstract [en]

    Circuit switched NoC has, compared to packet switching, a longer setup time, guaranteed throughput and latency, higher clock frequency, lower HW complexity, and higher energy efficiency. Depending on packet size and throughput requirements they exhibit better or worse performance. In this paper we designed a circuit switched NoC and compared that with packet switched NoC. By speculation and analysis, we propose that, as packet size increases, performance decreases for packet switched NoC, while it increases for circuit switched NoC. By close examination on the router architecture, we suggest that circuit switched NoC can operate at a higher clock frequency than packet switched NoC, and thus at zero load above a certain packet size circuit switched NoC could be better than packet switched NoC in packet delay. Experiment results support our intuitions and analysis. We find the cross-over point, above which circuit switching has lower latency, is around 30 flits/packet under low load and 60-70 flits/packet under high network load.

  • 82.
    Liu, Shaoteng
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Parallel probe based dynamic connection setup in TDM NoCs2014Conference paper (Refereed)
    Abstract [en]

    We propose a Time-Division Multiplexing (TDM) based connection oriented NoC with a novel double time-wheel router architecture combined with a run-time parallel probing setup method. In comparison with traditional TDM connection setup methods, our design has the following advantages: (1) it allocates paths and time slots at run-time; (2) it is fast with predictable and bounded setup latency; (3) it avoids additional resources (no auxiliary network or central processor to find and manage connections); (4) it is fully distributed and therefore it scales nicely with network size. Compared to a packet based setup method, our probe based design can reduce path setup delay by 81% and increase network load by 110% in an 8×8 mesh, while avoiding the auxiliary network. Compared to a centralized method, our solution can double the success rate, while eliminating the central resource for path setup and reducing the wire overhead. Synthesis results suggest that our design is faster and smaller than all comparable solutions.

  • 83.
    Liu, Shaoteng
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Parallel Probing: Dynamic and constant time setup procedure in circuit switching NoC2012In: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2012, IEEE Computer Society, 2012, p. 1289-1294Conference paper (Refereed)
    Abstract [en]

    We propose a circuit switching Network-on-chip with a parallel probe searching setup method, which can search the entire network in constant time, only dependent on the network size but independent of the network load. Under a specific search policy, the setup procedure is guaranteed to terminate in time 3D+6 cycles, where D is the geometric distance between source and destination. If a path can be found, the method succeeds in 3D+6 cycles; if a path cannot be found, it fails in maximum 3D+6 cycles. Compared to previous work, our method can reduce the setup time and enhance the success rate of setups. Our experiments show that compared with a sequential probe searching method, this method can reduce the search time by up to 20%. Compared with a centralized channel allocator method, this method can enhance the success rate by up to 20%.

  • 84.
    Liu, Shaoteng
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Jantsch, Axel
    Vienna University of Technology, Austria.
    Highway in TDM NoCs2015In: Proceedings of the Ninth ACM/IEEE International Symposium on Networks-on-Chip (NoCS'15), ACM Digital Library, 2015Conference paper (Refereed)
    Abstract [en]

    TDM (Time Division Multiplexing) is a well-known technique to provide QoS guarantees in NoCs. However, unused time slots commonly exist in TDM NoCs. In the paper, we propose a TDM highway technique which can enhance the slot utilization of TDM NoCs. A TDM highway is an express TDM connection composed of special buffer queues, called highway channels (HWCs). It can enhance the throughput and reduce data transfer delay of the connection, while keeping the quality of service (QoS) guarantee on minimum bandwidth and in-order packet delivery. We have developed a dynamic and repetitive highway setup policy which has no dependency on particular TDM NoC techniques and no overhead on traffic flows. As a result, highways can be efficiently established and utilized in various TDM NoCs.

    According to our experiments, compared to a traditional TDM NoC, adding one HWC with two buffers to every input port of routers in an 8×8 mesh can reduce data delay by up to 80% and increase the maximum throughput by up to 310%. More improvements can be achieved by adding more HWCs per input per router, or more buffers per HWC. We also use a set of MPSoC application benchmarks to evaluate our highway technique. The experiment results suggest that with highway, we can reduce application run time up to 51%.

  • 85. Long, Y.
    et al.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Yan, X.
    Analysis and evaluation of per-flow delay bound for multiplexing models2014In: Proceedings -Design, Automation and Test in Europe, DATE, Institute of Electrical and Electronics Engineers Inc. , 2014Conference paper (Refereed)
    Abstract [en]

    Multiplexing models are common in resource sharing communication media such as buses, crossbars and networks. While sending packets over a multiplexing node, the packet delay bound can be computed using network calculus models. The tightness of such delay bound remains an open problem. This paper studies the multiplexing models for weighted round robin scheduling with different traffic arrival curves, and analyzes per-flow packet delay bounds with different service properties. We empirically evaluate the tightness of the delay bounds. Our results show the quality of different analysis models, and how influential each parameter is to tightness.

  • 86. Long, Yanchen
    et al.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT).
    Shen, Haibin
    Composable Worst-Case Delay Bound Analysis Using Network Calculus2018In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, ISSN 0278-0070, E-ISSN 1937-4151, Vol. 37, no 3, p. 705-709Article in journal (Refereed)
    Abstract [en]

    Performance analysis is playing an indispensable role in design and evaluation for on-chip networks. In former studies, the end-to-end delay bound is calculated by the equivalent service curve method based on network calculus when resource sharing happens. However, in this paper, we propose a composable method to get the bound. This method uses the aggregated local arrival curve to get the local delay bound first, then calculates the end-to-end bound by summing up local bounds. This method solves the scalability problem and largely decreases the computation complexity compared with the former method.

  • 87.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Automotive Ethernet: Towards TSN and Beyond2016In: COMPUTER SAFETY, RELIABILITY, AND SECURITY, SAFECOMP 2016, 2016Conference paper (Refereed)
  • 88.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Cross Clock-Domain TDM Virtual Circuits for Networks on Chips2011In: NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip, 2011, p. 209-216Conference paper (Refereed)
    Abstract [en]

    We propose cross clock-domain time-division-multiplexing (TDM) Virtual Circuit (VC), in short, VC, to provide delay and bandwidth guaranteed communication for NoCs with multiple clock domains. The cross-domain VC extends the synchronous VC in a single clock domain to multiple clock domains. The synchronous VCs reserve cyclic time slots at each node from source to destination for a traffic flow to use shared links without contention based on the assumption that all nodes share the same notion of time. However, when VCs pass multiple clock domains with different phases and frequencies, the assumption of global synchrony is not valid any more and consequently they cannot function correctly. This paper addresses this problem based on a typical FIFO clock domain interface. We give the conditions and a realization scheme to ensure correct packet delivery with QoS for VCs crossing multiple clock domains. We apply network calculus to analyze and derive the bounds of the packet delay and the FIFO size.

  • 89.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Design and Analysis of On-Chip Communication for Network-on-Chip Platforms2007Doctoral thesis, comprehensive summary (Other scientific)
    Abstract [en]

    Due to the interplay between increasing chip capacity and complex applications, System-on-Chip (SoC) development is confronted by severe challenges, such as managing deep submicron effects, scaling communication architectures and bridging the productivity gap. Network-on-Chip (NoC) has been a rapidly developed concept in recent years to tackle the crisis with focus on network-based communication. NoC problems spread in the whole SoC spectrum ranging from specification, design, implementation to validation, from design methodology to tool support. In the thesis, we formulate and address problems in three key NoC areas, namely, on-chip network architectures, NoC network performance analysis, and NoC communication refinement.

    Quality and cost are major constraints for micro-electronic products, particularly, in high-volume application domains. We have developed a number of techniques to facilitate the design of systems with low area, high and predictable performance. From flit admission and ejection perspective, we investigate the area optimization for a classical wormhole architecture. The proposals are simple but effective. Not only offering unicast services, on-chip networks should also provide effective support for multicast. We suggest a connection-oriented multicasting protocol which can dynamically establish multicast groups with quality-of-service awareness. Based on the concept of a logical network, we develop theorems to guide the construction of contention-free virtual circuits, and employ a back-tracking algorithm to systematically search for feasible solutions.

    Network performance analysis plays a central role in the design of NoC communication architectures. Within a layered NoC simulation framework, we develop and integrate traffic generation methods in order to simulate network performance and evaluate network architectures. Using these methods, traffic patterns may be adjusted with locality parameters and be configured per pair of tasks. We propose also an algorithm-based analysis method to estimate whether a wormhole-switched network can satisfy the timing constraints of real-time messages. This method is built on traffic assumptions and based on a contention tree model that captures direct and indirect network contentions and concurrent link usage.

    In addition to NoC platform design, application design targeting such a platform is an open issue. Following the trends in SoC design, we use an abstract and formal specification as a starting point in our design flow. Based on the synchronous model of computation, we propose a top-down communication refinement approach. This approach decouples the tight global synchronization into process local synchronization, and utilizes synchronizers to achieve process synchronization consistency during refinement. Meanwhile, protocol refinement can be incorporated to satisfy design constraints such as reliability and throughput.

    The thesis summarizes the major research results on the three topics.

  • 90.
    Lu, Zhonghai
    KTH, School of Information and Communication Technology (ICT), Microelectronics and Information Technology, IMIT.
    Using wormhole switching for networks on chip: feasibility analysis and microarchitecture adaptation2005Licentiate thesis, comprehensive summary (Other scientific)
    Abstract [en]

    Network-on-Chip (NoC) is proposed as a systematic approach to address future System-on-Chip (SoC) design difficulties. Due to its good performance and small buffering requirement, wormhole switching is being considered as a main network flow control mechanism for on-chip networks. Wormhole switching for NoCs is challenging from NoC application design and switch complexity reduction.

    In a NoC design flow, mapping an application onto the network should conduct a feasibility analysis in order to determine whether the messages’ timing constraints can be satisfied, and whether the network can be efficiently utilized. This is necessary because network contentions lead to nondeterministic behavior in message delivery. For wormhole-switched networks, we have formulated a contention tree model to accurately capture network contentions and reflect the concurrent use of links. Based on this model, the timing bounds of real-time messages can be derived. Furthermore, we have developed an algorithm to test the feasibility of real-time messages in the networks.

    From the wormhole switch micro-architecture level, switch complexity should be minimized to reduce cost but with reasonable performance penalty. We have investigated the flit admission and flit ejection problems that concern how the flits of packets are admitted into and ejected from the network, respectively. For flit admission, we propose a novel coupling scheme which binds a flit-admission queue with an output physical channel. Our results show that this scheme achieves a reduction of up to 8% in switch area and up to 35% in switch power over other comparable solutions. For flit ejection, we propose a p-sink model which differs from a typical ideal ejection model in that it uses only p flit sinks to eject flits instead of p • v flit sinks as required by the ideal model, where p is the number of physical channels of a switch and v is the number of virtual channels per physical channel. With this model, the buffering cost of flit sinks only depends on p, i.e., is irrespective of v. We have evaluated the coupled flit-admission technique and p-sink model in a 2D 4 x 4 mesh network. In our experiments, they exhibit only limited performance penalties in some cases. We believe that these cost-effective models are promising candidates to be used in wormhole-switched on-chip networks.

  • 91.
    Lu, Zhonghai
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Brachos, Dimitris
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    A Flow Regulator for On-Chip Communication2009In: IEEE INTERNATIONAL SOC CONFERENCE, PROCEEDINGS / [ed] Sezer S; Marshall A; Buechner T, 2009, p. 151-154Conference paper (Refereed)
    Abstract [en]

    We have proposed (sigma, rho)-based flow regulation as a design instrument for System-on-Chip (SoC) architects to control quality-of-service and achieve cost-effective communication, where sigma bounds the traffic burstiness and rho the traffic rate. In this paper, we present a hardware implementation of the regulator. We discuss its microarchitecture. Based on this microarchitecture, we design, implement and synthesize a multi-flow regulator for AXI. Our experiments show the effectiveness of such a regulation device on the control of delay, jitter and buffer requirements.

  • 92.
    Lu, Zhonghai
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Admitting and ejecting flits in wormhole-switched networks on chip2007In: Iet Computers and Digital Techniques, ISSN 1751-8601, Vol. 1, no 5, p. 546-556Article in journal (Refereed)
    Abstract [en]

    Reducing the design complexity of switches is essential for cost reduction and power saving in on-chip networks. In wormhole-switched networks, packets are split into flits which are then admitted into and delivered in the network. When reaching destinations, flits are ejected from the network. Since flit admission, flit delivery and flit ejection interfere with each other directly and indirectly, techniques for admitting and ejecting flits exert a significant impact on network performance and switch cost. Different flit-admission and flit-ejection micro-architectures are investigated. In particular, for flit admission, a novel coupling scheme which binds a flit-admission queue with a physical channel (PC) is presented. This scheme simplifies the switch crossbar from 2p x p to (p + 1) x p, where p is the number of PCs per switch. For flit ejection, a p-sink model that uses only p flit sinks to eject flits is proposed. In contrast to an ideal ejection model which requires p . v flit sinks (v is the number of virtual channels per PC), the buffering cost of flit sinks becomes independent of v. The proposed flit-admission and flit-ejection schemes are evaluated with both uniform and locality traffic in a 2D 4 x 4 mesh network. The results show that both schemes do not degrade network performance in terms of average packet latency and throughput if the flit injection rate is slower than 0.57 flit/cycle/node.

  • 93.
    Lu, Zhonghai
    et al.
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Flit admission in on-chip wormhole-switched networks with virtual channels2004In: 2004 INTERNATIONAL SYMPOSIUM ON SYSTEM-ON-CHIP, PROCEEDINGS, IEEE conference proceedings, 2004, p. 21-24Conference paper (Refereed)
    Abstract [en]

    Flit-admission solutions for wormhole switches must minimize the complexity of the switches in order to achieve cheap implementations. We propose to couple flit-admission buffers with physical channels so that flits from a flit-admission buffer are dedicated to a physical channel. By the coupling strategy, for input-queuing wormhole lane switches, the complexity of the crossbars can be simplified from 2p x p to (p + 1) x p, where p is the number of physical channels; for output-queuing wormhole lane switches, the additional complexity is also minimal. We evaluate the flit-admission solutions derived from the coupling with uniformly distributed random traffic in a 2D mesh network. Experimental results show that these solutions exhibit good performance in terms of latency and throughput.

  • 94.
    Lu, Zhonghai
    et al.
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Flit ejection in on-chip wormhole-switched networks with virtual channels2004In: 22ND NORCHIP CONFERENCE, PROCEEDINGS, IEEE conference proceedings, 2004, p. 273-276Conference paper (Refereed)
    Abstract [en]

    An ideal it-ejection model is typically assumed in the literature for wormhole switches with virtual channels. With such a model, its are ejected from the network immediately upon reaching their destinations. This achieves optimal performance but is very costly. The required number of sink queues of a switch for absorbing its is p center dot v, where p is the number of physical channels (PCs) of the switch; v the number of lanes per PC To achieve cheap silicon implementations, it-ejection solutions must be cost-effective. We present a novel it-ejection model and a variant of it where the required number of sink queues of a switch is p, i.e., independent of v. We evaluate the it-ejection models with uniformly distributed random traf c in a 2D mesh network. Experimental results show that they exhibit good performance in latency and throughput.

  • 95.
    Lu, Zhonghai
    et al.
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Network-on-Chip Assembler Language2003Report (Other academic)
  • 96.
    Lu, Zhonghai
    et al.
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Refinement for Communication-Based Design2003In: Swedish System-on-Chip Conference (SSoCC’03), 2003Conference paper (Other academic)
  • 97.
    Lu, Zhonghai
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Slot allocation using logical networks for TDM virtual-circuit configuration for network-on-chip2007In: IEEE/ACM International Conference on Computer-Aided Design, Digest of Technical Papers, ICCAD, 2007, p. 18-25Conference paper (Refereed)
    Abstract [en]

    Configuring Time-Division-Multiplexing (TDM) Virtual Circuits (VCs) for network-on-chip must guarantee conflict freedom for overlapping VCs besides allocating sufficient time slots to them. These requirements are fulfilled in the slot allocation phase. In the paper, we define the concept of a logical network (LN). Based on this concept, we develop and prove theorems that constitute sufficient and necessary conditions to establish conflict-free VCs. Using these theorems, slot allocation for VCs becomes a procedure of computing LNs and then assigning VCs to different LNs. TDM VC configuration can thus be predictable and correct-by-construction. We have integrated this slot allocation method into our multi-node VC configuration program and applied the program to an industrial application.

  • 98.
    Lu, Zhonghai
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    TDM virtual-circuit configuration for network-on-chip2008In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 16, no 8, p. 1021-1034Article in journal (Refereed)
    Abstract [en]

    In network-on-chip (NoC), time-division-multiplexing (TDM) virtual circuits (VCs) have been proposed to satisfy the quality-of-service requirements of applications. TDM VC is a connection-oriented communication service by which two or more connections take turns to share buffers and link bandwidth using dedicated time slots. In the paper, we first give a formulation of the multinode VC configuration problem for arbitrary NoC topologies. A multinode VC allows multiple source and destination nodes on it. Then we address the two problems of path selection and slot allocation for TDM VC configuration. For the path selection, we use a backtracking algorithm to explore the path diversity, constructively searching the solution space. In the slot allocation phase, overlapped VCs must be configured such that no conflict occurs and their bandwidth requirements are satisfied. We define the concept of a logical network (LN) as an infinite set of associated (time slot, buffer) pairs with respect to a buffer on a given VC. Based on this concept, we develop and prove theorems that constitute sufficient and necessary conditions to establish conflict-free VCs. They are applicable for networks where all nodes operate with the same clock frequency but allowing different phases. Using these theorems, slot allocation for VCs is a procedure of assigning VCs to different LNs. TDM VC configuration can thus be predictable and correct-by-construction. Our experiments on synthetic and real applications validate the effectiveness and efficiency of our approach.

  • 99.
    Lu, Zhonghai
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Traffic configuration for evaluating networks on chips2005In: Fifth International Workshop on System-on-Chip for Real-Time Applications, Proceedings, IEEE Computer Society, 2005, p. 535-540Conference paper (Refereed)
    Abstract [en]

    Network-on-Chip (NoC) provides a network as a global communication platform for future SoC designs. Evaluating network architectures requires both synthetic workloads and application-oriented traffic. We present our traffic configuration methods that can be used to configure uniform and locality traffic as synthetic workloads, and to configure channel-based traffic for specific application(s). We also illustrate the significance of applying these methods to configure traffic for network evaluation and system simulation. These traffic configuration methods have been integrated into our Nostrum NoC simulation environment.

  • 100.
    Lu, Zhonghai
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Trends of Terascale Computing Chips in the Next Ten Years2009In: 2009 IEEE 8TH INTERNATIONAL CONFERENCE ON ASIC, VOLS 1 AND 2, PROCEEDINGS / [ed] Tang TA; Zeng XY; Chen Y; Yu HH, NEW YORK: IEEE , 2009, p. 62-66Conference paper (Refereed)
    Abstract [en]

    Moore's law steadily continues though facing a number of challenges. This paper identifies ongoing and desirable trends to exploit the technology capacity and flirt her Moore 's law for terascale on-chip computing architectures in the next ten years. Four foreseeable trends are: from single core to many cores, from bus-based to network-based interconnect, from centralized memory to distributed memory, and from 2D integration to 3D integration. We motivate these trends and show that the number of design choices for computing chips is increasing rapidly, leading to an exploding design space with uncountable opportunities for the innovative architect. Moreover, we envision that the multicore Network-on-Chip will become an infrastructure backbone and accumulate many other infrastructural functions such as memory, power and resource management, testing and diagnostic services.

1234 51 - 100 of 193
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf