Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 185) Show all publications
Shami, M. A., Tajammul, M. A. & Hemani, A. (2019). Configurable FFT Processor Using Dynamically Reconfigurable Resource Arrays. Journal of Signal Processing Systems, 91(5), 459-473
Open this publication in new window or tab >>Configurable FFT Processor Using Dynamically Reconfigurable Resource Arrays
2019 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 91, no 5, p. 459-473Article in journal (Refereed) Published
Abstract [en]

This paper presents results of using a Coarse Grain Reconfigurable Architecture called DRRA (Dynamically Reconfigurable Resource Array) for FFT implementations varying in order and degree of parallelism using radix-2 decimation in time (DIT). The DRRA fabric is extended with memory architecture to be able to deal with data-sets much larger than what can be accommodated in the register files of DRRA. The proposed implementation scheme is generic in terms of the number of FFT point, the size of memory and the size of register file in DRRA. Two implementations (DRRA-1 and DRRA-2) have been synthesized in 65 nm technology and energy/delay numbers measured with post-layout annotated gate level simulations. The results are compared to other Coarse Grain Reconfigurable Architectures (CGRAs), and dedicated FFT processors for 1024 and 2048 point FFT. For 1024 point FFT, in terms of FFT operations per unit energy, DRRA-1 and DRRA-2 outperforms all CGRA by at least 2x and is worse than ASIC by 3.45x. However, in terms of energy-delay product DRRA-2 outperforms CGRAs by at least 1.66x and dedicated FFT processors by at least 10.9x. For 2048-point FFT, DRRA-1 and DRRA-2 are 10x better for energy efficiency and 94.84 better for energy-delay product. However, radix-2 implementation is worse by 9.64x and 255x in terms of energy efficiency and energy-delay product when compared against a radix-2(4) implementation.

Place, publisher, year, edition, pages
SPRINGER, 2019
Keywords
FFT, DRRA, CGRA, Distributed processing, 2048-point FFT, 1024-point FFT, ASIC, Dedicated processors, Synthesis, Address generation
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-252621 (URN)10.1007/s11265-017-1326-7 (DOI)000467551000005 ()2-s2.0-85065474455 (Scopus ID)
Note

QC 20190603

Available from: 2019-06-03 Created: 2019-06-03 Last updated: 2019-06-03Bibliographically approved
Jafri, S. M., Hemani, A. & Stathis, D. (2018). Can a reconfigurable architecture beat ASIC as a CNN accelerator?. In: Proceedings - 2017 17th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2017. Paper presented at 17th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2017, Samos, Greece, 16 July 2017 through 20 July 2017 (pp. 97-104). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Can a reconfigurable architecture beat ASIC as a CNN accelerator?
2018 (English)In: Proceedings - 2017 17th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2017, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 97-104Conference paper, Published paper (Refereed)
Abstract [en]

To exploit the high accuracy, inherent redundancy, and embarrassingly parallel nature of Convolutional Neural Networks (CNN), for intelligent embedded systems, many dedicated CNN accelerators have been presented. These accelerators are optimized to employ compression, tiling, and layer merging for a specific data flow/parallelism pattern. However, the dimension of a CNN differ widely from one application to another (and also from one layer to another). Therefore, the optimal parallelism and data flow pattern also differs significantly in different CNN layers. An efficient accelerator should have flexibility to not only efficiently support different data flow patterns but also to interleave and cascade them. To achieve this ability requires configuration overheads. This paper analyzes whether the reconfiguration overheads for interleaving and cascading multiple data flow and parallelism patterns are justified. To answer this question, we first design a reconfigurable CNN accelerator, called ReCon. ReCon is the compared with state-of-the-art accelerators. Post-layout synthesis results reveal that ReCon provides up to 2.2X higher throughput and up to 2.3X better energy efficiency at the cost of 26-35% additional area.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-233718 (URN)10.1109/SAMOS.2017.8344616 (DOI)2-s2.0-85050553456 (Scopus ID)9781538634370 (ISBN)
Conference
17th International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2017, Samos, Greece, 16 July 2017 through 20 July 2017
Note

QC 20180831

Available from: 2018-08-31 Created: 2018-08-31 Last updated: 2018-08-31Bibliographically approved
Yang, Y., Stathis, D., Sharma, P., Paul, K., Hemani, A., Grabherr, M. & Ahmad, R. (2018). RiBoSOM: Rapid bacterial genome identification using self-organizing map implemented on the synchoros SiLago platform. In: ACM International Conference Proceeding Series: . Paper presented at 18th Annual International conference on Embedded Computer Systems: Architectures, MOdeling and Simulation, SAMOS 2018, 15 July 2018 through 19 July 2018 (pp. 105-114). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>RiBoSOM: Rapid bacterial genome identification using self-organizing map implemented on the synchoros SiLago platform
Show others...
2018 (English)In: ACM International Conference Proceeding Series, Association for Computing Machinery (ACM), 2018, p. 105-114Conference paper, Published paper (Refereed)
Abstract [en]

Artificial Neural Networks have been applied to many traditional machine learning applications in image and speech processing. More recently, ANNs have caught attention of the bioinformatics community for their ability to not only speed up by not having to assemble genomes but also work with imperfect data set with duplications. ANNs for bioinformatics also have the added attraction of better scaling for massive parallelism compared to traditional bioinformatics algorithms. In this paper, we have adapted Self-organizing Maps for rapid identification of bacterial genomes called BioSOM. BioSOM has been implemented on a design of two coarse grain reconfigurable fabrics customized for dense linear algebra and streaming scratchpad memory respectively. These fabrics are implemented in a novel synchoros VLSI design style that enables composition by abutment. The synchoricity empowers rapid and accurate synthesis from Matlab models to create near ASIC like efficient solution. This platform, called SiLago (Silicon Lego) is benchmarked against a GPU implementation. The SiLago implementation of BioSOMs in four different dimensions, 128, 256, 512 and 1024 Neurons, were trained for two E Coli strains of bacteria with 40K training vectors. The results of SiLago implementation were benchmarked against a GPU GTX 1070 implementation in the CUDA framework. The comparison reveals 4 to 140X speed up and 4 to 5 orders of improvement in energy-delay product compared to implementation on GPU. This extreme efficiency comes with the added benefit of automated generation of GDSII level design from Matlab by using the Synchoros VLSI design style.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
Series
ACM International Conference Proceeding Series
Keywords
Neural networks, Self-Organizing Maps, SiLago, Synchoros VLSI Design, Parallel architecture, 3D DRAM, GPU
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-247206 (URN)10.1145/3229631.3229650 (DOI)000475843000013 ()2-s2.0-85060986330 (Scopus ID)9781450364942 (ISBN)
Conference
18th Annual International conference on Embedded Computer Systems: Architectures, MOdeling and Simulation, SAMOS 2018, 15 July 2018 through 19 July 2018
Note

QC 20190416

Available from: 2019-04-16 Created: 2019-04-16 Last updated: 2019-08-14Bibliographically approved
Liu, P., Hemani, A., Paul, K., Weis, C., Jung, M. & Wehn, N. (2017). 3D-Stacked Many-Core Architecture for Biological Sequence Analysis Problems. International journal of parallel programming, 45(6), 1420-1460
Open this publication in new window or tab >>3D-Stacked Many-Core Architecture for Biological Sequence Analysis Problems
Show others...
2017 (English)In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 45, no 6, p. 1420-1460Article in journal (Refereed) Published
Abstract [en]

Sequence analysis plays extremely important role in bioinformatics, and most applications of which have compute intensive kernels consuming over 70% of total execution time. By exploiting the compute intensive execution stages of popular sequence analysis applications, we present and evaluate a VLSI architecture with a focus on those that target at biological sequences directly, including pairwise sequence alignment, multiple sequence alignment, database search, and short read sequence mappings. Based on coarse grained reconfigurable array we propose the use of many-core and 3D-stacked technologies to gain further improvement over memory subsystem, which gives another order of magnitude speedup from high bandwidth and low access latency. We analyze our approach in terms of its throughput and efficiency for different application mappings. Initial experimental results are evaluated from a stripped down implementation in a commodity FPGA, and then we scale the results to estimate the performance of our architecture with 9 layers of stacked wafers in 45-nm process. We demonstrate numerous estimated speedups better than corresponding existed hardware accelerator platforms for at least 40 times for the entire range of applications and datasets of interest. In comparison, the alternative FPGA based accelerators deliver only improvement for single application, while GPGPUs perform not well enough on accelerating program kernel with random memory access and integer addition/comparison operations.

Place, publisher, year, edition, pages
SPRINGER/PLENUM PUBLISHERS, 2017
Keywords
Accelerator architectures, Application specific integrated circuits, Bioinformatics, Computational biology, Coprocessors, Reconfigurable architectures, Three-dimensional integrated circuits
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-215775 (URN)10.1007/s10766-017-0495-0 (DOI)000411558500010 ()2-s2.0-85017448393 (Scopus ID)
Note

QC 20171023

Available from: 2017-10-23 Created: 2017-10-23 Last updated: 2018-01-13Bibliographically approved
Liu, P., Hemani, A., Paul, K., Weis, C., Jung, M. & Wehn, N. (2017). A Customized Many-Core Hardware Acceleration Platform for Short Read Mapping Problems Using Distributed Memory Interface with 3D-Stacked Architecture. Journal of Signal Processing Systems, 87(3), 327-341
Open this publication in new window or tab >>A Customized Many-Core Hardware Acceleration Platform for Short Read Mapping Problems Using Distributed Memory Interface with 3D-Stacked Architecture
Show others...
2017 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 87, no 3, p. 327-341Article in journal (Refereed) Published
Abstract [en]

Rapidly developing Next Generation Sequencing technologies produce huge amounts of short reads that consisting randomly fragmented DNA base pair strings. Assembling of those short reads poses a challenge on the mapping of reads to a reference genome in terms of both sensitivity and execution time. In this paper, we propose a customized many-core hardware acceleration platform for short read mapping problems based on hash-index method. The processing core is highly customized to suite both 2-hit string matching and banded Smith-Waterman sequence alignment operations, while distributed memory interface with 3D-stacked architecture provides high bandwidth and low access latency for highly customized dataset partitioning and memory access scheduling. Conformal with original BFAST program, our design provides an amazingly 45,012 times speedup over software approach for single-end short reads and 21,102 times for paired-end short reads, while also beats similar single FPGA solution for 1466 times in case of single end reads. Optimized seed generation gives much better sensitivity while the performance boost is still impressive.

Place, publisher, year, edition, pages
Springer, 2017
Keywords
Accelerator architectures, Application specific integrated circuits, Bioinformatics, Computational biology, Coprocessors, Three-dimensional integrated circuits
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-208228 (URN)10.1007/s11265-016-1204-8 (DOI)000399451800005 ()2-s2.0-85001022032 (Scopus ID)
Note

QC 20170627

Available from: 2017-06-27 Created: 2017-06-27 Last updated: 2017-06-27Bibliographically approved
Jafri, S., Hemani, A. & Intesa, L. (2017). SPEED: Open-Source Framework to Accelerate Speech Recognition on Embedded GPUs. In: Proceedings - 20th Euromicro Conference on Digital System Design, DSD 2017: . Paper presented at 20th Euromicro Conference on Digital System Design, DSD 2017, Vienna, Austria, 30 August 2017 through 1 September 2017 (pp. 94-101). Institute of Electrical and Electronics Engineers (IEEE), Article ID 8049772.
Open this publication in new window or tab >>SPEED: Open-Source Framework to Accelerate Speech Recognition on Embedded GPUs
2017 (English)In: Proceedings - 20th Euromicro Conference on Digital System Design, DSD 2017, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 94-101, article id 8049772Conference paper, Published paper (Refereed)
Abstract [en]

Due to high accuracy, inherent redundancy, and embarrassingly parallel nature, the neural networks are fast becoming mainstream machine learning algorithms. However, these advantages come at the cost of high memory and processing requirements (that can be met by either GPUs, FPGAs or ASICs). For embedded systems, the requirements are particularly challenging because of stiff power and timing budgets. Due to the availability of efficient mapping tools, GPUs are an appealing platforms to implement the neural networks. While, there is significant work that implements the image recognition (in particular Convolutional Neural Networks) on GPUs, only a few works deal with efficiently implement of speech recognition on GPUs. The work that does focus on implementing speech recognition does not address embedded systems. To tackle this issue, this paper presents SPEED (Open-source framework to accelerate speech recognition on embedded GPUs). We have used Eesen speech recognition framework because it is considered as the most accurate speech recognition technique. Experimental results reveal that the proposed techniques offer 2.6X speedup compared to state of the art.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2017
Keywords
GPU, Machine learning, Neural Networks, Speech Recognition
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-223018 (URN)10.1109/DSD.2017.89 (DOI)000427097100013 ()2-s2.0-85034452083 (Scopus ID)9781538621455 (ISBN)
Conference
20th Euromicro Conference on Digital System Design, DSD 2017, Vienna, Austria, 30 August 2017 through 1 September 2017
Note

QC 20180226

Available from: 2018-02-26 Created: 2018-02-26 Last updated: 2018-04-03Bibliographically approved
Hemani, A., Jafri, S. & Masoumian, S. (2017). Synchoricity and NOCs could make Billion Gate custom hardware centric SOCs affordable. In: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017: . Paper presented at 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Seoul, South Korea, 19 October 2017 through 20 October 2017. Association for Computing Machinery (ACM), Article ID 8.
Open this publication in new window or tab >>Synchoricity and NOCs could make Billion Gate custom hardware centric SOCs affordable
2017 (English)In: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Association for Computing Machinery (ACM), 2017, article id 8Conference paper (Refereed)
Abstract [en]

In this paper, we present a novel synchoros VLSI design scheme that discretizes space uniformly. Synchoros derives from the Greek word chóros for space. We propose raising the physical design abstraction to register transfer level by using coarse grain reconfigurable building blocks called SiLago blocks. SiLago blocks are hardened, synchoros and are used to create arbitrarily complex VLSI design instances by abutting them and not requiring any further logic and physical syntheses. SiLago blocks are interconnected by two levels of NOCs, regional and global. By configuring the SiLago blocks and the two levels of NOCs, it is possible to create implementation alternatives whose cost metrics can be evaluated with agility and post layout accuracy. This framework, called the SiLago framework includes a synthesis based design flow that allows end to end automation of multi-million gate functionality modeled as SDF in Simulink to be transformed into timing and DRC clean physical design in minutes, while exploring 100s of solutions. We benchmark the synthesis efficiency, and silicon and computational efficiencies against the conventional standard cell based tooling to show two orders improvement in accuracy and three orders improvement in synthesis while eliminating the need to verify at lower abstractions like RTL. The proposed solution is being extended to deal with system-level non-compile time functionalities. We also present arguments on how synchoricity could also contribute to eliminating the engineering cost of designing masks to lower the manufacturing cost.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2017
Keywords
ASICs, Coarse Grain Reconfiguration, ESL, High-level Synthesis, NOCs, SOCs, Synchoricity, VLSI Design
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-219649 (URN)10.1145/3130218.3132339 (DOI)2-s2.0-85035780779 (Scopus ID)9781450349840 (ISBN)
Conference
11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Seoul, South Korea, 19 October 2017 through 20 October 2017
Funder
VINNOVA
Note

QC 20171213

Available from: 2017-12-13 Created: 2017-12-13 Last updated: 2017-12-13Bibliographically approved
Farahini, N., Hemani, A. & Sohofi, H. (2016). AlgoSil: A High Level Synthesis Tool targeting Micro-architecture Level Physical Design Platform. KTH Royal Institute of Technology
Open this publication in new window or tab >>AlgoSil: A High Level Synthesis Tool targeting Micro-architecture Level Physical Design Platform
2016 (English)Report (Other academic)
Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2016
Series
TRITA-ICT ; 2016:14
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-185782 (URN)978-91-7595-973-3 (ISBN)
Note

QC 20160429

Available from: 2016-04-27 Created: 2016-04-27 Last updated: 2016-04-29Bibliographically approved
Jafri, S. M., Tajammul, M. A., Hemani, A., Paul, K., Plosila, J., Ellervee, P. & Tenuhnen, H. (2016). Polymorphic Configuration Architecture for CGRAs. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 24(1), 403-407
Open this publication in new window or tab >>Polymorphic Configuration Architecture for CGRAs
Show others...
2016 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 24, no 1, p. 403-407Article in journal (Refereed) Published
Abstract [en]

In the era of platforms hosting multiple applications with arbitrary reconfiguration requirements, static configuration architectures are neither optimal nor desirable. The static reconfiguration architectures either incur excessive overheads or cannot support advanced features (like time-sharing and runtime parallelism). As a solution to this problem, we present a polymorphic configuration architecture (PCA) that provides each application with a configuration infrastructure tailored to its needs.

Place, publisher, year, edition, pages
IEEE, 2016
Keywords
Memory architecture, memory management, multiprocessor interconnection, reconfigurable logic
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-180970 (URN)10.1109/TVLSI.2015.2402392 (DOI)000367261900045 ()2-s2.0-84961376632 (Scopus ID)
Note

QC 20160128

Available from: 2016-01-28 Created: 2016-01-26 Last updated: 2018-01-10Bibliographically approved
Badawi, M., Lu, Z. & Hemani, A. (2016). Service-Guaranteed Multi-Port PacketMemory for Parallel Protocol Processing Architecture. In: Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016: . Paper presented at 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing PDP 2016, Greece (pp. 408-412). Institute of Electrical and Electronics Engineers (IEEE), Article ID 7445367.
Open this publication in new window or tab >>Service-Guaranteed Multi-Port PacketMemory for Parallel Protocol Processing Architecture
2016 (English)In: Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016, Institute of Electrical and Electronics Engineers (IEEE), 2016, p. 408-412, article id 7445367Conference paper, Published paper (Refereed)
Abstract [en]

Parallel processing architectures have been increasingly utilized due to their potential for improving performance and energy efficiency. Unfortunately, the anticipated improvement often suffers from a limitation caused by memory access latency and latency variation, which consequently impact Quality of Service (QoS). This paper presents a service-guaranteed multi-port packet memory system to boost parallelism in protocol processing architectures. In this proposed memory system, all arriving packets are guaranteed a memory space, such that, a packet memory space can be allocated in a bounded number of cycles and each of its locations is accessible in a single cycle. We consider a real-time Voice Over Internet Protocol (VOIP) call as a case-study to evaluate our service-guaranteed memory system.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2016
Keywords
Multi-port Memory, Packet-oriented Memory, Protocol processing, service-guaranteed
National Category
Embedded Systems
Identifiers
urn:nbn:se:kth:diva-184159 (URN)000381810900061 ()2-s2.0-84968884496 (Scopus ID)9781467387750 (ISBN)
Conference
24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing PDP 2016, Greece
Note

QC 20160419

Available from: 2016-03-29 Created: 2016-03-29 Last updated: 2016-10-05Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0565-9376

Search in DiVA

Show all publications