Change search
Refine search result
1 - 36 of 36
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Anwar, Hassan
    et al.
    Jafri, Syed Mohammad Asad Hassan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Sergei, Dytckov
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Exploring Spiking Neural Network on Coarse-Grain Reconfigurable Architectures2014In: ACM International Conference Proceeding Series, 2014, p. 64-67Conference paper (Refereed)
    Abstract [en]

    Today, reconfigurable architectures are becoming increas- ingly popular as the candidate platforms for neural net- works. Existing works, that map neural networks on re- configurable architectures, only address either FPGAs or Networks-on-chip, without any reference to the Coarse-Grain Reconfigurable Architectures (CGRAs). In this paper we investigate the overheads imposed by implementing spiking neural networks on a Coarse Grained Reconfigurable Ar- chitecture (CGRAs). Experimental results (using point to point connectivity) reveal that up to 1000 neurons can be connected, with an average response time of 4.4 msec.

  • 2.
    Farahini, Nasim
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Jafri, S. M. A. H.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Sohofi, Hassan
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    SiLago: A Structured Layout Scheme to Enable Efficient High Level and System Level Synthesis2016Report (Other academic)
  • 3.
    Farahini, Nasim
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Sohofi, Hassan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jafri, Syed M. A. H.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Tajammul, Muhammad Adeel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric2014In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436, Vol. 38, no 8, p. 788-802Article in journal (Refereed)
    Abstract [en]

    This paper presents a hardware based solution for a scalable runtime address generation scheme for DSP applications mapped to a parallel distributed coarse grain reconfigurable computation and storage fabric. The scheme can also deal with non-affine functions of multiple variables that typically correspond to multiple nested loops. The key innovation is the judicious use of two categories of address generation resources. The first category of resource is the low cost AGU that generates addresses for given address bounds for affine functions of up to two variables. Such low cost AGUs are distributed and associated with every read/write port in the distributed memory architecture. The second category of resource is relatively more complex but is also distributed but shared among a few storage units and is capable of handling more complex address generation requirements like dynamic computation of address bounds that are then used to configure the AGUs, transformation of non-affine functions to affine function by computing the affine factor outside the loop, etc. The runtime computation of the address constraints results in negligibly small overhead in latency, area and energy while it provides substantial reduction in program storage, reconfiguration agility and energy compared to the prevalent pre-computation of address constraints. The efficacy of the proposed method has been validated against the prevalent address generation schemes for a set of six realistic DSP functions. Compared to the pre-computation method, the proposed solution achieved 75% average code compaction and compared to the centralized runtime address generation scheme, the proposed solution achieved 32.7% average performance improvement.

  • 4.
    Guang, Liang
    et al.
    University of Turku, Finland.
    Jafri, Syed Mohammad Asad Hassan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Yang, Bo
    University of Turku Finland.
    Plosila, Juha
    University of Turku Finland.
    Hannu, Tenhunen
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Embedding Fault-Tolerance with Dual-Level Agents in Many-Core Systems2012In: First MEDIAN Workshop (MEDIAN'12), 2012Conference paper (Other academic)
    Abstract [en]

    Dual-level fault-tolerance is presented on many-core systems, provided by the software-based system agent and hardware-based local agents. The system agent performs fault-triggered energy-aware remapping with bandwidth constraints, addressing coarse-grained processor failures. The local agents achieve fine-grained link-level fault tolerance against transient and permanent errors. The paper concisely presents the architecture, dual-level fault-tolerant techniques and experiment results.

  • 5.
    Guang, Liang
    et al.
    University of Turku, Finland.
    Jafri, Syed Mohammad Asad Hassan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Yang, Bo
    University of Turku, Finland.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hierarchical supporting structure for dynamic organization in many-core computing systems2013In: PECCS 2013: Proceedings of the 3rd International Conference on Pervasive Embedded Computing and Communication Systems, 2013, p. 252-261Conference paper (Refereed)
    Abstract [en]

    Hierarchical supporting structures for dynamic organization in many-core computing systems are presented.With profound hardware variations and unpredictable errors, dependability becomes a challenging issue in theemerging many-core systems. To provide fault-tolerance against processor failures or performance degradation,dynamic organization is proposed which allows clusters to be created and updated at the run-time. Hierarchicalsupporting structures are designed for each level of monitoring agents, to enable the tracing, storingand updating of component and system status. These supporting structures need to follow software/hardwareco-design to provide small and scalable overhead, while accommodating the functions of agents on the correspondinglevel. This paper presents the architectural design, functional simulation and implementationanalysis. The study demonstrates that the proposed structures facilitate the dynamic organization in caseof processor failures and incur small area overhead on many-core systems.

  • 6.
    Hemani, Ahmed
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Farahini, Nasim
    Jafri, Syed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Sohofi, Hassan
    KTH, School of Information and Communication Technology (ICT).
    Li, Shuo
    KTH, School of Information and Communication Technology (ICT).
    Paul, K.
    The silago solution: Architecture and design methods for a heterogeneous dark silicon aware coarse grain reconfigurable fabric2017In: The Dark Side of Silicon: Energy Efficient Computing in the Dark Silicon Era, Springer, 2017, p. 47-94Chapter in book (Refereed)
    Abstract [en]

    The dark silicon constraint will restrict the VLSI designers to utilize an increasingly smaller percentage of transistors as we progress deeper into nano-scale regime because of the power delivery and thermal dissipation limits. The best way to deal with the dark silicon constraint is to use the transistors that can be turned on as efficiently as possible. Inspired by this rationale, the VLSI design community has adopted customization as the principal means to address the dark silicon constraint. Two categories of customization, often in tandem have been adopted by the community. The first is the processors that are heterogeneous in functionality and/or have ability to more efficiently match varying functionalities and runtime load. The second category of customization is based on the fact that hardware implementations often offer 2–6 orders more efficiency compared to software. For this reason, designers isolate the power and performance critical functionality and map them to custom hardware implementations called accelerators. Both these categories of customizations are partial in being compute centric and still implement the bulk of functionality in the inefficient software style. In this chapter, we propose a contrarian approach: implement the bulk of functionality in hardware style and only retain control intensive and flexibility critical functionality in small simple processors that we call flexilators. We propose using a micro-architecture level coarse grain reconfigurable fabric as the alternative to the Boolean level standard cells and LUTs of the FPGAs as the basis for dynamically reconfigurable hardware implementation. This coarse grain reconfigurable fabric allows dynamic creation of arbitrarily wide and deep datapath with their hierarchical control that can be coupled with a cluster of storage resources to create private execution partitions that host individual applications. Multiple such partitions can be created that can operate at different voltage frequency operating points. Unused resources can be put into a range of low power modes. This CGRA fabric allows not just compute centric customization but also interconnect, control, storage and access to storage can be customized. The customization is not only possible at compile/build time but also at runtime to match the available resources and runtime load conditions. This complete, micro-architecture level hardware centric customization overcomes the limitations of partial compute centric customization offered by the state-of-the-art accelerator-rich heterogeneous multi-processor implementation style by extracting more functionality and performance from the limited number of transistors that can be turned on. Besides offering complete and more effective customization and a hardware centric implementation style, we also propose a methodology that dramatically reduces the cost of customization. This methodology is based on a concept called SiLago (Silicon Large Grain Objects) method. The core idea behind the SiLago method is to use large grain micro-architecture level hardened and characterized blocks, the SiLago blocks, as the atomic physical design building blocks and a grid based structured layout scheme that enables composition of the SiLago fabric simply by abutting the blocks to produce a timing and DRC clean GDSII design. Effectively, the SiLago method raises the abstraction of the physical design to micro-architectural level from the present Boolean level standard cell and LUT based physical design. This significantly improves the efficiency and predictability of synthesis from higher levels of abstraction. In addition, it also enables true system-level synthesis that by virtue of correct-by-construction guarantee eliminates the costly functional verification step. The proposed solution allows a fully customized design with dynamic fine grain power management to be automatically generated from Simulink down to GDSII with computational and silicon efficiencies that are modestly lower than ASIC. The micro-architecture level SiLago block based design process with correct by construction guarantee is 5–6 orders more efficient and 2 orders more accurate compared to the Boolean standard cell based design flows.

  • 7.
    Hemani, Ahmed
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Jafri, Syed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Masoumian, S.
    Synchoricity and NOCs could make Billion Gate custom hardware centric SOCs affordable2017In: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Association for Computing Machinery (ACM), 2017, article id 8Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a novel synchoros VLSI design scheme that discretizes space uniformly. Synchoros derives from the Greek word chóros for space. We propose raising the physical design abstraction to register transfer level by using coarse grain reconfigurable building blocks called SiLago blocks. SiLago blocks are hardened, synchoros and are used to create arbitrarily complex VLSI design instances by abutting them and not requiring any further logic and physical syntheses. SiLago blocks are interconnected by two levels of NOCs, regional and global. By configuring the SiLago blocks and the two levels of NOCs, it is possible to create implementation alternatives whose cost metrics can be evaluated with agility and post layout accuracy. This framework, called the SiLago framework includes a synthesis based design flow that allows end to end automation of multi-million gate functionality modeled as SDF in Simulink to be transformed into timing and DRC clean physical design in minutes, while exploring 100s of solutions. We benchmark the synthesis efficiency, and silicon and computational efficiencies against the conventional standard cell based tooling to show two orders improvement in accuracy and three orders improvement in synthesis while eliminating the need to verify at lower abstractions like RTL. The proposed solution is being extended to deal with system-level non-compile time functionalities. We also present arguments on how synchoricity could also contribute to eliminating the engineering cost of designing masks to lower the manufacturing cost.

  • 8.
    Jafri, Syed M. A. H.
    et al.
    KTH, School of Information and Communication Technology (ICT).
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Abbas, Naeem
    Serrano Leon, Guillermo
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    TransMap: Transformation Based Remapping and Parallelism for High Utilization and Energy Efficiency in CGRAs2016In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 65, no 11, p. 3456-3469Article in journal (Refereed)
    Abstract [en]

    In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping-. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., utilization and energy). To implement the dynamic remapping, reconfigurable architectures commonly store multiple (compile-time generated) implementations of an application. Each implementation represents a different platform location and/or degree of parallelism. The optimal implementation is selected at run-time. However, the compile-time binding either incurs excessive configuration memory overheads and/or is unable to map/parallelize an application even when sufficient resources are available. As a solution to this problem, we present Transformation based reMapping and parallelism (TransMap). TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application. Compared to state of the art, in addition to simple relocation in horizontal/vertical directions, TransMap also allows to rotate an application for mapping or parallelizing an application in resource constrained scenarios. By storing only a single implementation, TransMap offers significant reductions in configuration memory requirements (up to 73 percent for the tested applications), compared to state of the art compaction techniques. Simulation results reveal that the additional flexibility reduces the energy requirements by 33 percent and enhances the device utilization by 50 percent for the tested applications. Gate level analysis reveals that TransMap incurs negligible silicon (0.2 percent of the platform) and timing (6 additional cycles per application) penalty.

  • 9.
    Jafri, Syed
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Farahini, Nasim
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    SiLago-CoG: Coarse-Grained Grid-Based Design for Near Tape-Out Power Estimation Accuracy at High Level2017In: 2017 IEEE Computer Society Annual Symposium on VLSI ISVLSI 2017: 3-5 July 2017, Bochum, North Rhine-Westfalia, Germany : proceedings, IEEE Computer Society, 2017, Vol. 2017, p. 25-31, article id 7987490Conference paper (Refereed)
    Abstract [en]

    It is well known that ASICs have orders of magnitude higher power efficiency than general propose processors. However, due to the high engineering and manufacturing cost only handful of companies can afford to design ASICs. To reduce this cost numerous high-level synthesis tools have emerged since last 2-3 decades. In spite of these tools, ASIC design is still considered expensive because they fail to accurately predict the cost metrics. The inaccuracy is costly as it results in multiple iterations between RTL, logic synthesis, and physical design. The major reason behind this inaccuracy, at high level, is unavailability of information like wiring, orientation, and placement of hardware blocks. To tackle this issue, recent works have proposed to raise the abstraction of the physical design from standard cells to micro-architectural blocks physically organized in a structured grid based layout scheme. While these works have been successful in accurately predicting area and timing, to the best of our knowledge their effectiveness in accurately estimating power is yet to be determined. SiLago-CoG provides an efficient technique to characterize these blocks and estimate power at high level. Simulation and synthesis results reveal that SiLago-CoG provides up to 15X better power estimates in 680X less time at the cost of up to 50% additional area, compared to state-of-the-art.

  • 10.
    Jafri, Syed
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Intesa, Leonardo
    KTH.
    SPEED: Open-Source Framework to Accelerate Speech Recognition on Embedded GPUs2017In: Proceedings - 20th Euromicro Conference on Digital System Design, DSD 2017, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 94-101, article id 8049772Conference paper (Refereed)
    Abstract [en]

    Due to high accuracy, inherent redundancy, and embarrassingly parallel nature, the neural networks are fast becoming mainstream machine learning algorithms. However, these advantages come at the cost of high memory and processing requirements (that can be met by either GPUs, FPGAs or ASICs). For embedded systems, the requirements are particularly challenging because of stiff power and timing budgets. Due to the availability of efficient mapping tools, GPUs are an appealing platforms to implement the neural networks. While, there is significant work that implements the image recognition (in particular Convolutional Neural Networks) on GPUs, only a few works deal with efficiently implement of speech recognition on GPUs. The work that does focus on implementing speech recognition does not address embedded systems. To tackle this issue, this paper presents SPEED (Open-source framework to accelerate speech recognition on embedded GPUs). We have used Eesen speech recognition framework because it is considered as the most accurate speech recognition technique. Experimental results reveal that the proposed techniques offer 2.6X speedup compared to state of the art.

  • 11.
    Jafri, Syed
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Paul, Kolin
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Abbas, N.
    MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks2017In: Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 276-286, article id 7967117Conference paper (Refereed)
    Abstract [en]

    Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-ofthe-A rt for many applications (e.g. video/audio classification). The main challenge in implementing the CNNs, in embedded systems, is their large computation, memory, and bandwidth requirements. To meet these demands, dedicated hardware accelerators have been proposed. Since memory is the major cost in CNNs, recent accelerators focus on reducing the memory accesses. In particular, they exploit data locality using either tiling, layer merging or intra/inter feature map parallelism to reduce the memory footprint. However, they lack the flexibility to interleave or cascade these optimizations. Moreover, most of the existing accelerators do not exploit compression that can simultaneously reduce memory requirements, increase the throughput, and enhance the energy efficiency. To tackle these limitations, we present a flexible accelerator called MOCHA. MOCHA has three features that differentiate it from the state-of-the-art: (i) the ability to compress input/kernels, (ii) the flexibility to interleave various optimizations, and (iii) intelligence to automatically interleave and cascade the optimizations, depending on the dimension of a specific CNN layer and available resources. Post layout Synthesis results reveal that MOCHA provides up to 63% higher energy efficiency, up to 42% higher throughput, and up to 30% less storage, compared to the next best accelerator, at the cost of 26-35% additional area.

  • 12.
    Jafri, Syed. M. A. H.
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Bag, Ozan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Farahini, Nasim
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kolin, Paul
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Energy-Aware Coarse-Grained Reconfigurable Architectures using Dynamically Reconfigurable Isolation Cells2013In: Proceedings Of The Fourteenth International Symposium On Quality Electronic Design (ISQED 2013), 2013, p. 104-111Conference paper (Refereed)
    Abstract [en]

    This paper presents a self adaptive architecture to enhance the energy efficiency of coarse-grained reconfigurable architectures (CGRAs). Today, platforms host multiple applications, with arbitrary inter-application communication and concurrency patterns. Each application itself can have multiple versions (implementations with different degree of parallelism) and the optimal version can only be determined at runtime. For such scenarios, traditional worst case designs and compile time mapping decisions are neither optimal nor desirable. Existing solutions to this problem employ costly dedicated hardware to configure the operating point at runtime (using DVFS). As an alternative to dedicated hardware, we propose exploiting the reconfiguration features of modern CGRAs. Our solution relies on dynamically reconfigurable isolation cells (DRICs) and autonomous parallelism, voltage, and frequency selection algorithm (APVFS). The DRICs reduce the overheads of DVFS circuitry by configuring the existing resources as isolation cells. APVFS ensures high efficiency by dynamically selecting the parallelism, voltage and frequency trio, which consumes minimum power to meet the deadlines on available resources. Simulation results using representative applications (Matrix multiplication, FIR, and FFT) showed up to 23% and 51% reduction in power and energy, respectively, compared to traditional DVFS designs. Synthesis results have confirmed significant reduction in area overheads compared to state of the art DVFS methods.

  • 13.
    Jafri, Syed M. A. H.
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Abbas, N.
    Awan, M. A.
    Plosila, J.
    TEA: Timing and Energy Aware compression architecture for Efficient Configuration in CGRAs2015In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436Article in journal (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer time-multiplexing and dynamic applications parallelism to enhance device utilization and reduce energy consumption at the cost of additional memory (up to 50% area of the overall platform). To reduce the memory overheads, novel CGRAs employ either statistical compression, intermediate compact representation, or multicasting. Each compaction technique has different properties (i.e. compression ratio, decompression time and decompression energy) and is best suited for a particular class of applications. However, existing research only deals with these methods separately. Moreover, they only analyze the compaction ratio and do not evaluate the associated energy overheads. To tackle these issues, we propose a polymorphic compression architecture that interleaves these techniques in a unique platform. The proposed architecture allows each application to take advantage of a separate compression/decompression hierarchy (consisting of various types and implementations of hardware/software decoders) tailored to its needs. Simulation results, using different applications (FFT, Matrix multiplication, and WLAN), reveal that the choice of compression hierarchy has a significant impact on compression ratio (up to 52%), decompression energy (up to 4 orders of magnitude), and configuration time (from 33. n to 1.5. s) for the tested applications. Synthesis results reveal that introducing adaptivity incurs negligible additional overheads (1%) compared to the overall platform area.

  • 14.
    Jafri, Syed M. A. H.
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Tajammul, Adeel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Indian Institute of Technology.
    Ellervee, Peeter
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Morphable Compression Architecture for Efficient Configuration in CGRAs2014In: 2014 17th Euromicro Conference on Digital System Design (DSD), 2014, p. 42-49Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads (up to 50% area of the overall platform). As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties (i.e. compression ratio and decoding time), and is therefore best suited for a particular class of applications (and situation). However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform. The proposed architecture allows each application to enjoy a separate compression/decompression hierarchy (consisting of various types and implementations of hardware/software decoders) tailored to its needs. Thereby, our solution offers minimal memory while meeting the required configuration deadlines. Simulation results, using different applications (FFT, Matrix multiplication, and WLAN), reveal that the choice of compression hierarchy has a significant impact on compression ratio (from configware replication to 52%) and configuration cycles (from 33 nsec to 1.5 secs) for the tested applications. Synthesis results reveal that introducing adaptivity incurs negligible additional overheads (1%) compared to the overall platform area.

  • 15.
    Jafri, Syed M.A.H.
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Tajammul, Adeel
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Paul, Kolin
    Ellervee, Peeter
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics.
    Customizable Compression Architecture for Efficient Configuration in CGRAs2011In: Proceedings: 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, 2011, p. 31-31Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads. As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties, and is therefore best suited for a particular class of applications. However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform.

  • 16.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Bag, Ozan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Farahini, Nasim
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Energy-Aware CGRAs using Dynamically Re-configurable isolation Cells2013Conference paper (Refereed)
    Abstract [en]

    This paper presents a self adaptive architectureto enhance the energy efficiency of coarse-grained reconfigurablearchitectures (CGRAs). Today, platforms host multipleapplications, with arbitrary inter-application communication andconcurrency patterns. Each application itself can have multipleversions (implementations with different degree of parallelism)and the optimal version can only be determined at runtime. Forsuch scenarios, traditional worst case designs and compile timemapping decisions are neither optimal nor desirable. Existingsolutions to this problem employ costly dedicated hardware toconfigure the operating point at runtime (using DVFS). As analternative to dedicated hardware, we propose exploiting thereconfiguration features of modern CGRAs. Our solution relieson dynamically reconfigurable isolation cells (DRICs) and autonomousparallelism, voltage, and frequency selection algorithm(APVFS). The DRICs reduce the overheads of DVFS circuitryby configuring the existing resources as isolation cells. APVFSensures high efficiency by dynamically selecting the parallelism,voltage and frequency trio, which consumes minimum powerto meet the deadlines on available resources. Simulation resultsusing representative applications (Matrix multiplication, FIR,and FFT) showed up to 23% and 51% reduction in powerand energy, respectively, compared to traditional DVFS designs.Synthesis results have confirmed significant reduction in areaoverheads compared to state of the art DVFS methods.

  • 17.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. Turku Centre for Computer Science, Finland; University of Turku, Finland.
    Gia, T.N.
    University of Turku, Finland.
    Dytckov, Sergei
    University of Turku, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    Turku Centre for Computer Science, Finland; University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    NeuroCGRA: A CGRA with support for neural networks2014In: Proceedings of the 2014 International Conference on High Performance Computing and Simulation, HPCS 2014, IEEE , 2014, p. 506-511Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) are becoming an increasingly popular implementation platform. In real world applications, the CGRAs are required to simultaneously host processing (e.g. Audio/video acquisition) and estimation (e.g. audio/video/image recognition) tasks. For estimation problems, neural networks, promise a higher efficiency than conventional processing. However, most of the existing CGRAs provide no support for neural networks. To realize realize both neural networks and conventional processing on the same platform, this paper presents NeuroCGRA. NeuroCGRA allows the processing elements and the network to dynamically morph into either conventional CGRA or a neural network, depending on the hosted application. We have chosen the DRRA as a vehicle to study the feasibility and overheads of our approach. Synthesis results reveal that the proposed enhancements incur negligible overheads (4.4% area and 9.1% power) compared to the original DRRA cell.

  • 18.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Guang, Liang
    University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Energy-Aware Fault-Tolerant Network-on-Chips for Addressing Multiple Traffic Classes2012In: Proceedings: 15th Euromicro Conference on Digital System Design, DSD 2012, 2012, p. 242-249Conference paper (Refereed)
    Abstract [en]

    This paper presents an energy efficient architectureto provide on-demand fault tolerance to multiple traffic classes,running simultaneously on single network on chip (NoC) platform.Today, NoCs host multiple traffic classes with potentiallydifferent reliability needs. Providing platform-wide worst-case(maximum) protection to all the classes is neither optimal nordesirable. To reduce the overheads incurred by fault tolerance,various adaptive strategies have been proposed. The proposedtechniques rely on individual packet fields and operating conditionsto adjust the intensity and hence the overhead of faulttolerance. Presence of multiple traffic classes undermines theeffectiveness of these methods. To complement the existing adaptivestrategies, we propose on-demand fault tolerance, capableof providing required reliability, while significantly reducing theenergy overhead. Our solution relies on a hierarchical agentbased control layer and a reconfigurable fault tolerance datapath. The control layer identifies the traffic class and directs thepacket to the path providing the needed reliability. Simulationresults using representative applications (matrix multiplication,FFT, wavefront, and HiperLAN) showed up to 95% decrease inenergy consumption compared to traditional worst case methods.Synthesisresultshave confirmedanegligible additionaloverhead,for providing on-demand protection (up to 5.3% area), comparedto the overall fault tolerance circuitry.

  • 19.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Guang, Liang
    University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Indian Institute of Technology, Delhi, India.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Energy-aware fault-tolerant network-on-chips for addressing multiple traffic classes2013In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436, Vol. 37, no 8, p. 811-822Article in journal (Refereed)
    Abstract [en]

    This paper presents an energy efficient architecture to provide on-demand fault tolerance to multiple traffic classes, running simultaneously on single network on chip (NoC) platform. Today, NoCs host multiple traffic classes with potentially different reliability needs. Providing platform-wide worst-case (maximum) protection to all the classes is neither optimal nor desirable. To reduce the overheads incurred by fault tolerance, various adaptive strategies have been proposed. The proposed techniques rely on individual packet fields and operating conditions to adjust the intensity and hence the overhead of fault tolerance. Presence of multiple traffic classes undermines the effectiveness of these methods. To complement the existing adaptive strategies, we propose on-demand fault tolerance, capable of providing required reliability, while significantly reducing the energy overhead. Our solution relies on a hierarchical agent based control layer and a reconfigurable fault tolerance data path. The control layer identifies the traffic class and directs the packet to the path providing the needed reliability. Simulation results using representative applications (matrix multiplication, FFT, wavefront, and HiperLAN) showed up to 95% decrease in energy consumption compared to traditional worst case methods. Synthesis results have confirmed a negligible additional overhead, for providing on-demand protection (up to 5.3% area), compared to the overall fault tolerance circuitry.

  • 20.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Guang, Liang
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Self-Adaptive NoC Power Management with Dual-Level Agents: Architecture and Implementation2012In: PECCS 2012 - Proceedings of the 2nd International Conference on Pervasive Embedded Computing and Communication Systems, 2012, p. 450-458Conference paper (Refereed)
    Abstract [en]

    Architecture and Implementation of adaptive NoC to improve performance and power consumption is presented. On platforms hosting multiple applications, hardware variations and unpredictable workloads make static design-time assignments highly sub-optimal e.g. in terms of power and performance. As a solution to this problem, adaptive NoCs are designed, which dynamically adapt towards optimal implementation. This paper addresses the architectural design of adaptive NoC, which is an essential step towards design automation. The architecture involves two levels of agents: a system level agent implemented in software on a dedicated general purpose processor and the local agents implemented as microcontrollers of each network node. The system agent issues specific instructions to perform monitoring and reconfiguration operations, while the local agents operate according to the commands from the system agent. To demonstrate the system architecture, best-effort power management with distributed voltage and frequency scaling is implemented, while meeting run-time execution requirements. Four benchmarks (matrix multiplication, FFT, wavefront, and hiperLAN transmitter) are experimented on a cycle-accurate RTL-level shared-memory NoC simulator. Power analysis with 65nm multi-Vdd library shows a significant reduction in energy consumption (from 21 % to 36 %). The synthesis also shows minimal area overhead (4 %) of the local agent compared to the original NoC switch.

  • 21.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    KTH, School of Information and Communication Technology (ICT), Centres, VinnExcellence Center for Intelligence in Paper and Packaging, iPACK.
    Plosila, Juha
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Compact Generic Intermediate representation (CGIR) to enable late binding in Coarse Grained Reconfigurable Architectures2011Conference paper (Refereed)
  • 22.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Compression Based Efficient and Agile Configuration Mechanism for Coarse Grained Reconfigurable Architectures2011In: Proc. IEEE Int Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW) Symp, 2011, p. 290-293Conference paper (Refereed)
    Abstract [en]

    This paper considers the possibility of speeding up the configuration by reducing the size of configware in coarsegrained reconfigurable architectures (CGRAs). Our goal was to reduce the number of cycles and increase the configuration bandwidth. The proposed technique relies on multicasting and bitstream compression. The multicasting reduces the cycles by configuring the components performing identical functions simultaneously, in a single cycle, while the bitstream compression increases the configuration bandwidth. We have chosen the dynamically reconfigurable resource array (DRRA) architecture as a vehicle to study the efficiency of this approach. In our proposed method, the configuration bitstream is compressed offline and stored in a memory. If reconfiguration is required, the compressed bitstream is decompressed using an online decompresser and sent to DRRA. Simulation results using practical applications showed upto 78% and 22% decrease in configuration cycles for completely parallel and completely serial implementations, respectively. Synthesis results have confirmed nigligible overhead in terms of area (1.2 %) and timing.

  • 23.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH.
    Hemani, Ahmed
    KTH.
    Stathis, Dimmitrios
    KTH.
    Can a reconfigurable architecture beat ASIC as a CNN accelerator?2018In: Proceedings - 2017 17th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2017, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 97-104Conference paper (Refereed)
    Abstract [en]

    To exploit the high accuracy, inherent redundancy, and embarrassingly parallel nature of Convolutional Neural Networks (CNN), for intelligent embedded systems, many dedicated CNN accelerators have been presented. These accelerators are optimized to employ compression, tiling, and layer merging for a specific data flow/parallelism pattern. However, the dimension of a CNN differ widely from one application to another (and also from one layer to another). Therefore, the optimal parallelism and data flow pattern also differs significantly in different CNN layers. An efficient accelerator should have flexibility to not only efficiently support different data flow patterns but also to interleave and cascade them. To achieve this ability requires configuration overheads. This paper analyzes whether the reconfiguration overheads for interleaving and cascading multiple data flow and parallelism patterns are justified. To answer this question, we first design a reconfigurable CNN accelerator, called ReCon. ReCon is the compared with state-of-the-art accelerators. Post-layout synthesis results reveal that ReCon provides up to 2.2X higher throughput and up to 2.3X better energy efficiency at the cost of 26-35% additional area.

  • 24.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Leon, Guillermo Serrano
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Abbas, N.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Indian Institute of Technology.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    TransPar: Transformation based dynamic Parallelism for low power CGRAs2014In: Conference Digest - 24th International Conference on Field Programmable Logic and Applications, FPL 2014, 2014Conference paper (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer runtime parallelism to reduce energy consumption (by lowering voltage/frequency). To implement the runtime parallelism, CGRAs commonly store multiple compile-time generated implementations of an application (with different degree of parallelism) and select the optimal version at runtime. However, the compile-time binding incurs excessive configuration memory overheads and/or is unable to parallelize an application even when sufficient resources are available. As a solution to this problem, we propose Transformation based dynamic Parallelism (TransPar). TransPar stores only a single implementation and applies a series for transformations to generate the bitstream for the parallel version. In addition, it also allows to displace and/or rotate an application to parallelize in resource constrained scenarios. By storing only a single implementation, TransPar offers significant reductions in configuration memory requirements (up to 73% for the tested applications), compared to state of the art compaction techniques. Simulation and synthesis results, using real applications, reveal that the additional flexibility allows up to 33% energy reduction compared to static memory based parallelism techniques. Gate level analysis reveals that TransPar incurs negligible silicon (0.2% of the platform) and timing (6 additional cycles per application) penalty.

  • 25.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Leon, Guillermo Serrano
    Iqbal, J.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Indian Institute of Technology.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    RuRot: Run-time rotatable-expandable partitions for efficient mapping in CGRAs2014In: Proceedings - International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2014, 2014, p. 233-241Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.

  • 26.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Piestrak, S. J.
    Sentieys, O.
    Pillement, S.
    Design of a fault-tolerant coarse-grained reconfigurable architecture: A case study2010In: Proceedings of the 11th International Symposium on Quality Electronic Design, ISQED 2010, 2010, p. 845-852Conference paper (Refereed)
    Abstract [en]

    This paper considers the possibility of implementing low-cost hardware techniques which would allow to tolerate temporary faults in the datapaths of coarse-grained reconfigurable architectures (CGRAs). Our goal was to use less hardware overhead than commonly used duplication or triplication methods. The proposed technique relies on concurrent error detection by using residue code modulo 3 and re-execution of the last operation, once an error is detected. We have chosen the DART architecture as a vehicle to study the efficiency of this approach to protect its datapaths. Simulation results have confirmed hardware savings of the proposed approach over duplication.

  • 27.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Piestrak, S. J.
    Sentieys, O.
    Pillement, S.
    Design of the coarse-grained reconfigurable architecture DART with on-line error detection2014In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436, Vol. 38, no 2, p. 124-136Article in journal (Refereed)
    Abstract [en]

    This paper presents the implementation of the coarse-grained reconfigurable architecture (CGRA) DART with on-line error detection intended for increasing fault-tolerance. Most parts of the data paths and of the local memory of DART are protected using residue code modulo 3, whereas only the logic unit is protected using duplication with comparison. These low-cost hardware techniques would allow to tolerate temporary faults (including so called soft errors caused by radiation), provided that some technique based on re-execution of the last operation is used. Synthesis results obtained for a 90 nm CMOS technology have confirmed significant hardware and power consumption savings of the proposed approach over commonly used duplication with comparison. Introducing one extra pipeline stage in the self-checking version of the basic arithmetic blocks has allowed to significantly reduce the delay overhead compared to our previous design.

  • 28.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Piestrak, Stanislaw J.
    IJL/Universit´e de Lorraine, France.
    Paul, Kolin
    Indian Institute of Technology, Delhi, India.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Energy-Aware Fault-Tolerant CGRAs Addressing Application with Different Reliability Needs2013In: Digital System Design (DSD), 2013 Euromicro Conference on, IEEE conference proceedings, 2013, p. 525-534Conference paper (Refereed)
    Abstract [en]

    In this paper, we propose a polymorphic fault tolerant architecture that can be tailored to efficiently support the reliability needs of multiple applications at run-time. Today, coarse-grained reconfigurable architectures (CGRAs) host multiple applications with potentially different reliability needs. Providing platform-wide worst-case (maximum) protection to all the applications is neither optimal nor desirable. To reduce the fault-tolerance overhead, adaptive fault-tolerance strategies have been proposed. The proposed techniques access the reliability requirements of each application and adjust the fault-tolerance intensity (and hence overhead), accordingly. However, existing flexible reliability schemes only allow to shift between different levels of modular redundancy (duplication, triplication, etc.) and deal with only a single class of faults (e.g. soft errors). To complement these strategies, we propose energy-aware fault-tolerance that, in addition to modular redundancy, can also provide low cost, sub-modular (e.g. residue mod 3) redundancy, to cater both permanent and temporary faults. Our solution relies on an agent based control layer and a configurable fault-tolerance data path. The control layer identifies the application class and configures the data path to provide the needed reliability. Simulation results using a few selected algorithms (FFT, matrix multiplication, and FIR filter) showed that the proposed method provides flexible protection with energy overhead ranging from 3.125% to 107% for different reliability levels. Synthesis results have confirmed that the proposed architecture significantly reduces the area overhead for self-checking (59.1%) and fault tolerant (7.1%) versions, compared to the state of the art adaptive reliability techniques.

  • 29.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Tajammul, Muhammad Adeel
    KTH.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Paul, Kolin
    Plosila, Juha
    Ellervee, Peeter
    Tenuhnen, Hannu
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics.
    Polymorphic Configuration Architecture for CGRAs2016In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 24, no 1, p. 403-407Article in journal (Refereed)
    Abstract [en]

    In the era of platforms hosting multiple applications with arbitrary reconfiguration requirements, static configuration architectures are neither optimal nor desirable. The static reconfiguration architectures either incur excessive overheads or cannot support advanced features (like time-sharing and runtime parallelism). As a solution to this problem, we present a polymorphic configuration architecture (PCA) that provides each application with a configuration infrastructure tailored to its needs.

  • 30.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Tajammul, Muhammad Adeel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. Indian Institute of Technology.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Energy-Aware-Task-Parallelism for Efficient Dynamic Voltage, and Frequency Scaling, in CGRAs2013In: Proceedings - 2013 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, IC-SAMOS 2013, IEEE , 2013, p. 104-112Conference paper (Refereed)
    Abstract [en]

    Today, coarse grained reconfigurable architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Each application itself is composed of multiple tasks, spatially mapped to different parts of platform. Providing worst-case operating point to all applications leads to excessive energy and power consumption. To cater this problem, dynamic voltage and frequency scaling (DVFS) is a frequently used technique. DVFS allows to scale the voltage and/or frequency of the device, based on runtime constraints. Recent research suggests that the efficiency of DVFS can be significantly enhanced by combining dynamic parallelism with DVFS. The proposed methods exploit the speedup induced by parallelism to allow aggressive frequency and voltage scaling. These techniques, employ greedy algorithm, that blindly parallelizes a task whenever required resources are available. Therefore, it is likely to parallelize a task(s) even if it offers no speedup to the application, thereby undermining the effectiveness of parallelism. As a solution to this problem, we present energy aware task parallelism. Our solution relies on a resource allocation graphs and an autonomous parallelism, voltage, and frequency selection algorithm. Using resource allocation graph, as a guide, the autonomous parallelism, voltage, and frequency selection algorithm parallelizes a task only if its parallel version reduces overall application execution time. Simulation results, using representative applications (MPEG4, WLAN), show that our solution promises better resource utilization, compared to greedy algorithm. Synthesis results (using WLAN) confirm a significant reduction in energy (up to 36%), power (up to 28%), and configuration memory requirements (up to 36%), compared to state of the art.

  • 31.
    Jafri, Syed
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Piestrak, S. J.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, K.
    Plosila, J.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Implementation and evaluation of configuration scrubbing on CGRAs: A case study2013In: 2013 International Symposium on System-on-Chip, SoC 2013 - Proceedings, IEEE Computer Society, 2013, p. 6675262-Conference paper (Refereed)
    Abstract [en]

    This paper investigates the overhead imposed by various configuration scrubbing techniques used in fault-tolerant Coarse Grained Reconfigurable Arrays (CGRAs). Today, reconfigurable architectures host large configuration memories. As we progress further in the nanometer regime, these configuration memories have become increasingly susceptible to single event upsets caused e.g. by cosmic radiation. Configuration scrubbing is a frequently used technique to protect these configuration memories against single event upsets. Existing works on configuration scrubbing deal only with FPGA without any reference to the CGRAs (in which configuration memories consume up to 50% of silicon area). Moreover, in the known literature lacks a comprehensive comparison of various configuration scrubbing techniques to guide system designers about the merits/demerits of different scrubbing methods which could be applied to CGRAs. To address these problems, in this paper we classify various configuration scrubbing techniques and quantify their trade-offs when implemented on a CGRA. Synthesis results reveal that scrubbing logic incurs negligible silicon overhead (up to 3% of the area of computational units). Simulation results obtained for a few algorithms/applications (FFT, FIR, matrix multiplication, and WLAN) show that the choice of the configuration scrubbing scheme (external vs. internal) has significant impact on both the size of configuration memory and the number of reconfiguration cycles (respectively 20-80% more and up to 38 times more for the former).

  • 32. Ngyen, T.
    et al.
    Jafri, Syed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. Turku Centre for Computer Science, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland .
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Dytckov, Sergei
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland .
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    FIST: A framework to interleave spiking neural networks on CGRAs2015In: Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015, IEEE , 2015, p. 751-758Conference paper (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern embedded applications. In many application domains (e.g. robotics and cognitive embedded systems), the CGRAs are required to simultaneously host processing (e.g. Audio/video acquisition) and estimation (e.g. audio/video/image recognition) tasks. Recent works have revealed that the efficiency and scalability of the estimation algorithms can be significantly improved by using neural networks. However, existing CGRAs commonly employ homogeneous processing resources for both the tasks. To realize the best of both the worlds (conventional processing and neural networks), we present FIST. FIST allows the processing elements and the network to dynamically morph into either conventional CGRA or a neural network, depending on the hosted application. We have chosen the DRRA as a vehicle to study the feasibility and overheads of our approach. Synthesis results reveal that the proposed enhancements incur negligible overheads (4.4% area and 9.1% power) compared to the original DRRA cell.

  • 33.
    Tajammul, Muhammad Adeel
    et al.
    KTH.
    Jafri, Syed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Ellervee, P.
    TransMem: A memory architecture to support dynamic remapping and parallelism in low power high performance CGRAs2017In: Proceedings - 2016 26th International Workshop on Power and Timing Modeling, Optimization and Simulation, PATMOS 2016, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 92-99, article id 7833431Conference paper (Refereed)
    Abstract [en]

    In the nano scale era, the upcoming design challenges like dark silicon, power wall, and memory wall have prompted extensive research into the architectural alternatives to the general purpose processor. Coarse Grained Reconfig-urable Architectures (CGRAs) are emerging as one of the promising alternatives. Commonly, CGRAs are composed of a computation layer and a memory layer. Tempted by higher platform utilization and energy efficiency, recently proposed CGRAs offer dynamic remapping and parallelism. However, the existing works only address the computational elements, while for many applications the bulk of energy is consumed by the memory and memory accesses. Therefore, without architectural support to optimize the memory contents, according to the changes in computational layer, the benefits promised by dynamic parallelism and remapping are severely degraded. As a solution to this problem we present TransMem, a supporting memory infrastructure that complements the dynamic remapping and parallelism in the computational fabric. Simulation results reveal that the additional flexibility enhances the energy efficiency by up to 85% for the tested applications, compared to state of the art. Post-layout analysis reveals that TransMem incurs only 4% area penalty.

  • 34.
    Tajammul, Muhammad Adeel
    et al.
    KTH, School of Information and Communication Technology (ICT). Tallinn University of Technology, Estonia.
    Jafri, Syed M. A.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Ellerve, P.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics. University of Turku, Finland.
    Plosila, J
    DyMeP: An Infrastructure to Support Dynamic Memory Binding for Runtime Mapping in CGRAs2015In: Proceedings of the IEEE International Conference on VLSI Design, IEEE conference proceedings, 2015, no February, p. 547-552, article id 7031792Conference paper (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications. Commonly, CGRAs are composed of a computation layer (that performs computations) and a memory layer (that provides data and config ware to the computation layer). Tempted by higher platform utilization and reliability, recently proposed CGRA soffer dynamic application remapping (for the computation layer). Distributed scratch pad (compiler programmed) memories offer high data rates, predictability and low the power consumption (compared to caches). Therefore, the distributed scratchpad memories are emerging as preferred implementation alternative for the memory layer in recent CGRAs. However, the scratchpad memories are programmed at compile time, and do not support dynamic application remapping. The existing solutions that allow dynamic application remapping either rely on fat binaries (that significantly enhance configuration memory requirements) or consider a centralized memory. To extract the benefits of both runtime remapping and distributed scratchpad memories, we present a design framework called DyMeP. DyMeP relies on late binding and provides the architectural support to dynamically remap data in CGRAs. Compared to the state of the art, the proposed technique reduces the configuration memory requirements (needed by fat binary solutions) and supports distributed shared scratchpad memory. Synthesis/Simulation results reveal that DyMeP promises a significant (up to 60%) reduction in config ware size at the cost of negligible additional overheads (less then 1%).

  • 35.
    Tajammul, Muhammad Adeel
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jafri, Syed Mohammad Asad Hassan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Private configuration environments (PCE) for efficient reconfiguration, in CGRAs2013In: Proceedings Of The 2013 IEEE 24th International Conference On Application-Specific Systems, Architectures And Processors (ASAP 13), IEEE Computer Society, 2013, p. 227-236Conference paper (Refereed)
    Abstract [en]

    In this paper, we propose a polymorphic configuration architecture, that can be tailored to efficiently support reconfiguration needs of the applications at runtime. Today, CGRAs host multiple applications, running simultaneously on a single platform. Novel CGRAs allow each application to exploit late binding and time sharing for enhancing the power and area efficiency. These features require frequent reconfigurations, making reconfiguration time a bottleneck for time critical applications. Existing solutions to this problem either employ powerful configuration architectures or hide configuration latency (using configuration caching). However, both these methods incur significant costs when designed for worst-case reconfiguration needs. As an alternative to worst-case dedicated configuration mechanism, we exploit reconfiguration to provide each application its private configuration environment (PCE). PCE relies on a morphable configuration infrastructure, a distributed memory sub-system, and a set of PCE controllers. The PCE controllers customize the morphable configuration infrastructure and reserve portion of the a distributed memory sub-system, to act as a context memory for each application, separately. Thereby, each application enjoys its own configuration environment which is optimal in terms of configuration speed, memory requirements and energy. Simulation results using representative applications (WLAN and Matrix Multiplication) showed that PCE offers up to 58 % reduction in memory requirements, compared to dedicated, worst case configuration architecture. Synthesis results show that the morphable reconfiguration architecture incurs negligible overheads (3 % area and 4 % power compared of a single processing element).

  • 36.
    Yang, Y.
    et al.
    KTH.
    Jafri, Syed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Stathis, Dimitrios
    KTH, School of Information and Communication Technology (ICT), Electronics.
    MTP-caffe: Memory, timing, and power aware tool for mapping CNNs to GPUs2017In: ACM International Conference Proceeding Series, Association for Computing Machinery (ACM), 2017, p. 31-36Conference paper (Refereed)
    Abstract [en]

    In the recent past, the Convolutional Neural Networks (CNNs) have attracted intense research. The high processing requirements (of CNNs) and the availability of efficient mapping tools have made GPUs a popular CNN accelerator. To extract the maximum performance, the mapping tools transform the unsupported convolutions to GPU supported matrix multiplications. However, this transformation incurs significant memory overheads (3-5X). Furthermore, since the tool is unaware of the GPU architecture, even after the transformation the performance and power is sub-optimal. To tackle this problem we present MTP-Caffe that complements Caffe by making it memory, timing, and power aware. It analyses the CNN structure and the GPU architecture to convert a CNN into smaller parts, tailored for GPU resources. Simulation results reveal that MTP-Caffe not only eliminates the additional memory overheads but also provides up to 21% speedup and up to 23.5% less power.

1 - 36 of 36
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf