Change search
Refine search result
1 - 13 of 13
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Farahini, Nasim
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Li, Shuo
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Tajammul, Muhammad Adeel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Guo
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Ye, Wei
    Huawei, Wireless Beijing Division, China.
    39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation2013In: 2013 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE , 2013, p. 1448-1451Conference paper (Refereed)
    Abstract [en]

    This paper presents an industrial case study of using a Coarse Grain Reconfigurable Architecture (CGRA) for a multi-mode accelerator for two kernels: FFT for the LTE standard and the Correlation Pool for the UMTS standard to be executed in a mutually exclusive manner. The CGRA multi-mode accelerator achieved computational efficiency of 39.94 GOPS/watt (OP is multiply-add) and silicon efficiency of 56.20 GOPS/mm2. By analyzing the code and inferring the unused features of the fully programmable solution, an in-house developed tool was used to automatically customize the design to run just the two kernels and the two efficiency metrics improved to 49.05 GOPS/watt and 107.57 GOPS/mm2. Corresponding numbers for the ASIC implementation are 63.84 GOPS/watt and 90.91 GOPS/mm2. Though the ASIC’s silicon and computational efficiency numbers are slightly better, the engineering efficiency of the pre-verified/characterized CGRA solution is at least 10X better than the ASIC solution.

  • 2.
    Malik, Omer
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A Library Development Framework for a Coarse Grain Reconfigurable Architecture2011In: VLSI Design (VLSI Design), 2011 24th International Conference on, 2011, p. 153-158Conference paper (Refereed)
    Abstract [en]

    A framework for efficiently capturing the rich microarchitectural space of a substantial Matlab like library of DSP functions for a regular Coarse Grain Reconfigurable Architecture (CGRA) fabric is proposed. A subset of C has been proposed to model the DSP functions and an automatic tool to generate the configware for the CGRA fabric developed. A method to estimate the average energy of such functions is reported with error margin of less than 3%. Such a framework is proposed as the basis for raising the abstraction to automate synthesis of the entire physical layers.

  • 3.
    Malik, Omer
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    High Level Synthesis Framework for a Coarse Grain Reconfigurable Architecture2010In: 28th Norchip Conference, NORCHIP 2010, 2010, p. 5669439-Conference paper (Refereed)
    Abstract [en]

    A High Level Synthesis Framework for mapping DSP algorithms on a Coarse Grain Reconfigurable Architecture is presented. Behavioral specification of the algorithm in C is specified with pragmas in comments and the tool generates configware after performing timing and synchronization synthesis. Pragmas identify SIMD type concurrency and sweep the architectural space with allocation and binding annotations to produce implementations from fully serial to fully parallel. This allows user to stay at algorithmic level and guide the HLS tool to search a restricted architectural space bounded by the pragmas thus making the synthesis process more efficient and predictable.

  • 4.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Dynamically Reconfigurable Resource Array2012Doctoral thesis, monograph (Other academic)
    Abstract [en]

    The goals set by the International Technology Roadmap for Semiconductors (ITRS) for the consumer portable category, to be realized by 2020, are 1000X improvement in performance with only 40\% increase in power budget and no increase in design team size. To meet these goals, the challenges facing the VLSI community are gaps in architecture efficacy, design productivity and battery capacity.As the causes of the gaps in architecture efficacy and battery capacity, this thesis identifies: a) instruction granularity mismatch, b) bit-width granularity mismatch, c) silicon granularity mismatch and d) parallelism mismatch. Field Programmable Gate Array(FPGA) technology can address instruction/bit-width granularity and parallelism mismatch but suffers from silicon granularity mismatch due to high reconfiguration overheads. The ultimate design goal of a system-on-chip is to achieve an ASIC-like performance and FPGA-like flexibility, design time and cost. Coarse Grain Reconfigurable Architectures (CGRAs) are a compromise between ASIC and FPGA since they provide better computational efficiency compared to FPGAs and better engineering efficiency compared to ASIC. However, the current generation of CGRAs lack many architectural properties that would enable them to replace ASIC and/or FPGA by mainstream industry.To objectively discuss these properties, in the first part of the thesis a classification scheme has been proposed that classifies parallel computing machines into 47 classes and propose how they can be graded in terms of flexibility. We apply this classification scheme on academic and industrial reconfigurable architectures to compare them for their similarities and differences. We identify an instruction flow spatial computing class to be used for a CGRA fabric called Dynamically Reconfigurable Resource Array (DRRA) presented in the second part of this thesis. The DRRA fabric is a Parallel Distributed Digital Signal Processing (PDDSP) fabric with distributed arithmetic, logic, interconnect and control resources. Problems associated with the distributed control model of DRRA are identified and architectural solutions that can be exploited by the compiler tools are presented.After logical and physical synthesis, DRRA shows a peak performance of 21 GOPS and peak silicon efficiency of 16.03 GOPS/mm\textsuperscript{2}. We further performed a three-level validation of the DRRA fabric. At first level, we mapped a number of signal and compute intensive algorithms to demonstrate the flexibility of the DRRA fabric. At second level, we measured the gap between ASIC, DRRA and FPGA. On average DRRA shows 22.87x area, 10.75x power consumption, 852x configuration bits, 959x configuration cycles, 63,94x silicon efficiency, 4.78x computational efficiency, and 6.15E+10x better energy-delay product improvements compared to FPGA. Finally, at third level we present the use of DRRA for a real world example of implementing a 128-, 256-, 512-, 1024-, 2048-point configurable FFT processor. For 1024 point FFT, in terms of computational efficiency, DRRA outperforms all CGRAs by at least 2x and is worse than ASIC by 3.45x. As regards silicon efficiency, although dedicated processors perform 1.6x better, DRRA is better than all other CGRAs.

  • 5.
    Shami, Muhammad Ali
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Address generation scheme for a coarse grain reconfigurable architecture2011In: Proc. IEEE Int Application-Specific Systems, Architectures and Processors (ASAP) Conf, 2011, p. 17-24Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe a versatile address generation scheme for distributed storage resources of a coarse grain Parallel Distributed Digital Signal Processing (PDDSP) reconfigurable architecture under development in our group. This scheme proposes the distributed address generation units (AGUs) to decouple the address generation logic with compute logic to exploit parallelism (ILP and TLP). To achieve this, the proposed distributed address generation scheme with standard DSP address generation modes like linear vectorized, circular buffer and bit-reverse addressing, all with parameterizable address range and increment/decrement offsets is further enhanced with temporal flexibility by introducing three dynamically programmable delays: initial delay before the stream starts, middle delay after every address generation for the stream and end delay after the stream is complete. The dynamic programmability of these delays makes streams elastic that can be chained with an interrupt mechanism to create chained-elastic streams. Our approach is compared with the traditional approach of using VLIW and Scalar. Our approach shows 21times;(Scalar), 10×(VLIW) reduction in instructions and 2×(Scalar) reduction in cycles for a single thread FIR filter. When compared for Synchronous and Asynchronous scenarios of two parallel treads T1 and T2, our approach shows 4.6×(Scalar), 5.6×(VLIW) reduction in instructions, 1.76 reduction in cycles for Synchronous and 4.6×(Scalar), 15×(VLIW) eduction in instructions, 1.76×(Scalar) reduction in cycles for Asynchronous threads.

  • 6.
    Shami, Muhammad Ali
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    An improved self-reconfigurable interconnection scheme for a Coarse Grain Reconfigurable Architecture2010In: NORCHIP 2010: 28th Norchip Conference, 2010Conference paper (Refereed)
    Abstract [en]

    An improved Dynamic, Partial and self reconfigurable interconnection network (Hybrid-2 Network) is presented for Dynamically Reprogrammable Resource Array (DRRA), which is a Coarse Grain Reconfiguration Architecture (CGRA). To justify the design decision, Hybrid-2 network implementation is compared against the possible implementations using Multiplexer, NoC, Crossbar and already published Hybrid-1 interconnection network. Results shows that newly presented Hybrid-2 Interconnection network take (1.08x, 0.104x, 0.212x and 0.681x) the area, (1x, 0.037x, 0.026x and 0.107x) the configuration bits of Multiplexer, NoC, Crossbar and Hybrid-1 Implementation respectively. Hybrid-2 network is also 2.87x and 5.86x faster than Multiplexer and Hybrid-1 networks.

  • 7.
    Shami, Muhammad Ali
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Classification of Massively Parallel Computer Architectures2012In: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012, IEEE , 2012, p. 344-351Conference paper (Refereed)
    Abstract [en]

    Faced with slowing performance and energy benefits of technology scaling, VLSI/Computer architectures have turned from parallel to massively parallel machines for personal and embedded applications in the form of multi and many core architectures. Additionally, in the pursuit of finding the sweet spot between engineering and computational efficiency, massively parallel Coarse Grain Reconfigurable Architectures(CRGAs) have been researched. While these articles have been surveyed, they have not been rigorously classified to enable objective differentiation and comparison for performance, area and flexibility. In this paper, we extend the well known Skillicorn taxonomy to create new classes, present a scoring system to rate these classes on flexibility, and present equations for early estimation of area and configuration overheads. Furthermore, we use this extended classification scheme to classify and compare 25 different massively parallel architectures that covers most of the reported CGRAs and other well known multi and many core architectures.

  • 8.
    Shami, Muhammad Ali
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Control Scheme for a CGRA2010In: Proc. 22nd Int Computer Architecture and High Performance Computing (SBAC-PAD) Symp, 2010, p. 17-24Conference paper (Refereed)
    Abstract [en]

    Ability to instantiate low cost and agile FSMs that can implement an arbitrary parallelism and combine such FSMs in a chain and in a hierarchy is one of the key differentiating factors between the ASICs and MPSOCs. CGRAs that have been reported in literature, like MPSOCs, also lack this ASIC like ability. The downside of ASICs is their lack of reuse and high engineering cost. We present a CGRA architecture that retains the programmability of CGRA and yet has the ASIC like ability to construct a) arbitrarily parallel data-path/FSM combine, b) chain an arbitrary number of such FSMs and c) create a hierarchy of such chains. We present in detail the architecture of such a control scheme and illustrate its use for an example composed of FFT and FIRs. We quantify the benefits of our approach by benchmarking for energy-delay product against a) ASICs (4.8X worse), b) a state-of-the-art CGRA (4.58X better) and FPGAs (63.95X better).

  • 9.
    Shami, Muhammad Ali
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Morphable DPU: Smart and Efficient Data Path for Signal Processing Applications2009In: SIPS: 2009 IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEMS, 2009, p. 167-172Conference paper (Refereed)
    Abstract [en]

    A coarse grained morphable Datapath Unit (mDPU) has been proposed. This mDPU implements multiplier in a smart way that enables the component adders to be reused when we do not need the multiplier. A pipelined design further enhances the design by creating a balanced datapath in temporal sense. These two features results in a design that optimally uses silicon and time. A judicious set of Coarse Granular instructions are enabled by the mDPU that we show can implement typical signal processing functions. A radix-2 64 point FFT has been implemented in 90 nm technology using the proposed mDPUs and performance and energy results from physical design phase are reported and compared to a state-of-the-art comparable design from the research community. 4X improvement in performance and 2.5X improvement in power-performance product are reported.

  • 10.
    Shami, Muhammad Ali
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Partially Reconfigurable Interconnection Network for Dynamically Reprogrammable Resource Array2009In: 2009 IEEE 8TH INTERNATIONAL CONFERENCE ON ASIC, VOLS 1 AND 2, PROCEEDINGS / [ed] Tang TA; Zeng XY; Chen Y; Yu HH, NEW YORK: IEEE , 2009, p. 122-125Conference paper (Refereed)
    Abstract [en]

    This paper describes an innovative regular non-blocking, point-to-point, point-to-multipoint, low latency interconnection network scheme with sliding window connectivity, which allows arbitrary parallelism among large sub-systems. The area overhead of interconnect is only 30% of the chip area which is much smaller as compared to 80% in case of FPGA. The interconnection scheme is partially and dynamically reconfigurable. The configware is reduced 5.6 times by using binary encoding which allows energy efficient dynamic reconfiguration(1).

  • 11.
    Tajammul, Muhammad Adeel
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Segmented bus based path setup scheme for a distributed memory architecture2012In: Proceedings - IEEE 6th International Symposium on Embedded Multicore SoCs, MCSoC 2012, IEEE , 2012, p. 67-74Conference paper (Refereed)
    Abstract [en]

    This paper proposes a composite instruction for path setup and partitioning of a network on chip using segmented buses. The network connects a distributed memory to a coarse grained reconfigurable architecture. The scheme decreases the partitioning and routing instruction in sequencers (S) for the nodes (N) from Nx3 to a single instruction. This reduction in instruction also bear a small performance benefit as less instructions are scheduled onto the network. Furthermore, it is possible to optimizing the system under application specificconstraints. A simple use-case with experiments is defined to show for design trade-offs for these optimization decisions.

  • 12.
    Tajammul, Muhammad Adeel
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Moorthi, Sridharan
    NIT, Trichi, India.
    NoC Based Distributed Partitionable Memory System for a Coarse Grain Reconfigurable Architecture2011In: 24th Annual Conference on VLSI Design, IEEE Computer Society, 2011, p. 232-237Conference paper (Refereed)
  • 13.
    Tajammul, Muhammad Adeel
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Moorthi, Sridharan Moorthi
    NIT,Trichy,India .
    A NoC based distributed memory architecture with programmable and partitionable capabilities2010In: 28th Norchip Conference, NORCHIP 2010, 2010Conference paper (Refereed)
    Abstract [en]

    The paper focuses on the design of a Network-on-chip based programmable and partitionable distributed memory architecture which can be integrated with a Coarse Grain Reconfigurable Architecture (CGRA). The proposed interconnect enables better interaction between computation fabric and memory fabric. The system can modify its memory to computation element ratio at runtime. The extensive capabilities of the memory system are analyzed by interfacing it with a Dynamically Reconfigurable Resource Array (DRRA), a CGRA. The interconnect can provide multiple interfaces which supports upto 8 GB/s per interface.

1 - 13 of 13
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf