Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (10 of 181) Show all publications
Liu, P., Hemani, A., Paul, K., Weis, C., Jung, M. & Wehn, N. (2017). 3D-Stacked Many-Core Architecture for Biological Sequence Analysis Problems. International journal of parallel programming, 45(6), 1420-1460.
Open this publication in new window or tab >>3D-Stacked Many-Core Architecture for Biological Sequence Analysis Problems
Show others...
2017 (English)In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 45, no 6, 1420-1460 p.Article in journal (Refereed) Published
Abstract [en]

Sequence analysis plays extremely important role in bioinformatics, and most applications of which have compute intensive kernels consuming over 70% of total execution time. By exploiting the compute intensive execution stages of popular sequence analysis applications, we present and evaluate a VLSI architecture with a focus on those that target at biological sequences directly, including pairwise sequence alignment, multiple sequence alignment, database search, and short read sequence mappings. Based on coarse grained reconfigurable array we propose the use of many-core and 3D-stacked technologies to gain further improvement over memory subsystem, which gives another order of magnitude speedup from high bandwidth and low access latency. We analyze our approach in terms of its throughput and efficiency for different application mappings. Initial experimental results are evaluated from a stripped down implementation in a commodity FPGA, and then we scale the results to estimate the performance of our architecture with 9 layers of stacked wafers in 45-nm process. We demonstrate numerous estimated speedups better than corresponding existed hardware accelerator platforms for at least 40 times for the entire range of applications and datasets of interest. In comparison, the alternative FPGA based accelerators deliver only improvement for single application, while GPGPUs perform not well enough on accelerating program kernel with random memory access and integer addition/comparison operations.

Place, publisher, year, edition, pages
SPRINGER/PLENUM PUBLISHERS, 2017
Keyword
Accelerator architectures, Application specific integrated circuits, Bioinformatics, Computational biology, Coprocessors, Reconfigurable architectures, Three-dimensional integrated circuits
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-215775 (URN)10.1007/s10766-017-0495-0 (DOI)000411558500010 ()2-s2.0-85017448393 (Scopus ID)
Note

QC 20171023

Available from: 2017-10-23 Created: 2017-10-23 Last updated: 2018-01-13Bibliographically approved
Liu, P., Hemani, A., Paul, K., Weis, C., Jung, M. & Wehn, N. (2017). A Customized Many-Core Hardware Acceleration Platform for Short Read Mapping Problems Using Distributed Memory Interface with 3D-Stacked Architecture. Journal of Signal Processing Systems, 87(3), 327-341.
Open this publication in new window or tab >>A Customized Many-Core Hardware Acceleration Platform for Short Read Mapping Problems Using Distributed Memory Interface with 3D-Stacked Architecture
Show others...
2017 (English)In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 87, no 3, 327-341 p.Article in journal (Refereed) Published
Abstract [en]

Rapidly developing Next Generation Sequencing technologies produce huge amounts of short reads that consisting randomly fragmented DNA base pair strings. Assembling of those short reads poses a challenge on the mapping of reads to a reference genome in terms of both sensitivity and execution time. In this paper, we propose a customized many-core hardware acceleration platform for short read mapping problems based on hash-index method. The processing core is highly customized to suite both 2-hit string matching and banded Smith-Waterman sequence alignment operations, while distributed memory interface with 3D-stacked architecture provides high bandwidth and low access latency for highly customized dataset partitioning and memory access scheduling. Conformal with original BFAST program, our design provides an amazingly 45,012 times speedup over software approach for single-end short reads and 21,102 times for paired-end short reads, while also beats similar single FPGA solution for 1466 times in case of single end reads. Optimized seed generation gives much better sensitivity while the performance boost is still impressive.

Place, publisher, year, edition, pages
Springer, 2017
Keyword
Accelerator architectures, Application specific integrated circuits, Bioinformatics, Computational biology, Coprocessors, Three-dimensional integrated circuits
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-208228 (URN)10.1007/s11265-016-1204-8 (DOI)000399451800005 ()2-s2.0-85001022032 (Scopus ID)
Note

QC 20170627

Available from: 2017-06-27 Created: 2017-06-27 Last updated: 2017-06-27Bibliographically approved
Hemani, A., Jafri, S. & Masoumian, S. (2017). Synchoricity and NOCs could make Billion Gate custom hardware centric SOCs affordable. In: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017: . Paper presented at 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Seoul, South Korea, 19 October 2017 through 20 October 2017. Association for Computing Machinery (ACM), Article ID 8.
Open this publication in new window or tab >>Synchoricity and NOCs could make Billion Gate custom hardware centric SOCs affordable
2017 (English)In: 2017 11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Association for Computing Machinery (ACM), 2017, 8Conference paper (Refereed)
Abstract [en]

In this paper, we present a novel synchoros VLSI design scheme that discretizes space uniformly. Synchoros derives from the Greek word chóros for space. We propose raising the physical design abstraction to register transfer level by using coarse grain reconfigurable building blocks called SiLago blocks. SiLago blocks are hardened, synchoros and are used to create arbitrarily complex VLSI design instances by abutting them and not requiring any further logic and physical syntheses. SiLago blocks are interconnected by two levels of NOCs, regional and global. By configuring the SiLago blocks and the two levels of NOCs, it is possible to create implementation alternatives whose cost metrics can be evaluated with agility and post layout accuracy. This framework, called the SiLago framework includes a synthesis based design flow that allows end to end automation of multi-million gate functionality modeled as SDF in Simulink to be transformed into timing and DRC clean physical design in minutes, while exploring 100s of solutions. We benchmark the synthesis efficiency, and silicon and computational efficiencies against the conventional standard cell based tooling to show two orders improvement in accuracy and three orders improvement in synthesis while eliminating the need to verify at lower abstractions like RTL. The proposed solution is being extended to deal with system-level non-compile time functionalities. We also present arguments on how synchoricity could also contribute to eliminating the engineering cost of designing masks to lower the manufacturing cost.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2017
Keyword
ASICs, Coarse Grain Reconfiguration, ESL, High-level Synthesis, NOCs, SOCs, Synchoricity, VLSI Design
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-219649 (URN)10.1145/3130218.3132339 (DOI)2-s2.0-85035780779 (Scopus ID)9781450349840 (ISBN)
Conference
11th IEEE/ACM International Symposium on Networks-on-Chip, NOCS 2017, Seoul, South Korea, 19 October 2017 through 20 October 2017
Funder
VINNOVA
Note

QC 20171213

Available from: 2017-12-13 Created: 2017-12-13 Last updated: 2017-12-13Bibliographically approved
Farahini, N., Hemani, A. & Sohofi, H. (2016). AlgoSil: A High Level Synthesis Tool targeting Micro-architecture Level Physical Design Platform. KTH Royal Institute of Technology.
Open this publication in new window or tab >>AlgoSil: A High Level Synthesis Tool targeting Micro-architecture Level Physical Design Platform
2016 (English)Report (Other academic)
Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2016
Series
TRITA-ICT, 2016:14
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-185782 (URN)978-91-7595-973-3 (ISBN)
Note

QC 20160429

Available from: 2016-04-27 Created: 2016-04-27 Last updated: 2016-04-29Bibliographically approved
Jafri, S. M., Tajammul, M. A., Hemani, A., Paul, K., Plosila, J., Ellervee, P. & Tenuhnen, H. (2016). Polymorphic Configuration Architecture for CGRAs. IEEE Transactions on Very Large Scale Integration (vlsi) Systems, 24(1), 403-407.
Open this publication in new window or tab >>Polymorphic Configuration Architecture for CGRAs
Show others...
2016 (English)In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 24, no 1, 403-407 p.Article in journal (Refereed) Published
Abstract [en]

In the era of platforms hosting multiple applications with arbitrary reconfiguration requirements, static configuration architectures are neither optimal nor desirable. The static reconfiguration architectures either incur excessive overheads or cannot support advanced features (like time-sharing and runtime parallelism). As a solution to this problem, we present a polymorphic configuration architecture (PCA) that provides each application with a configuration infrastructure tailored to its needs.

Place, publisher, year, edition, pages
IEEE, 2016
Keyword
Memory architecture, memory management, multiprocessor interconnection, reconfigurable logic
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-180970 (URN)10.1109/TVLSI.2015.2402392 (DOI)000367261900045 ()2-s2.0-84961376632 (Scopus ID)
Note

QC 20160128

Available from: 2016-01-28 Created: 2016-01-26 Last updated: 2018-01-10Bibliographically approved
Badawi, M., Lu, Z. & Hemani, A. (2016). Service-Guaranteed Multi-Port PacketMemory for Parallel Protocol Processing Architecture. In: Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016: . Paper presented at 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing PDP 2016, Greece (pp. 408-412). Institute of Electrical and Electronics Engineers (IEEE), Article ID 7445367.
Open this publication in new window or tab >>Service-Guaranteed Multi-Port PacketMemory for Parallel Protocol Processing Architecture
2016 (English)In: Proceedings - 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2016, Institute of Electrical and Electronics Engineers (IEEE), 2016, 408-412 p., 7445367Conference paper, Published paper (Refereed)
Abstract [en]

Parallel processing architectures have been increasingly utilized due to their potential for improving performance and energy efficiency. Unfortunately, the anticipated improvement often suffers from a limitation caused by memory access latency and latency variation, which consequently impact Quality of Service (QoS). This paper presents a service-guaranteed multi-port packet memory system to boost parallelism in protocol processing architectures. In this proposed memory system, all arriving packets are guaranteed a memory space, such that, a packet memory space can be allocated in a bounded number of cycles and each of its locations is accessible in a single cycle. We consider a real-time Voice Over Internet Protocol (VOIP) call as a case-study to evaluate our service-guaranteed memory system.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2016
Keyword
Multi-port Memory, Packet-oriented Memory, Protocol processing, service-guaranteed
National Category
Embedded Systems
Identifiers
urn:nbn:se:kth:diva-184159 (URN)000381810900061 ()2-s2.0-84968884496 (Scopus ID)9781467387750 (ISBN)
Conference
24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing PDP 2016, Greece
Note

QC 20160419

Available from: 2016-03-29 Created: 2016-03-29 Last updated: 2016-10-05Bibliographically approved
Farahini, N., Hemani, A., Jafri, S. M. & Sohofi, H. (2016). SiLago: A Structured Layout Scheme to Enable Efficient High Level and System Level Synthesis. .
Open this publication in new window or tab >>SiLago: A Structured Layout Scheme to Enable Efficient High Level and System Level Synthesis
2016 (English)Report (Other academic)
Series
TRITA-ICT, 2016:13
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-185781 (URN)978-91-7595-974-0 (ISBN)
Note

QC 20160429

Available from: 2016-04-27 Created: 2016-04-27 Last updated: 2016-04-29Bibliographically approved
Hemani, A. (Ed.). (2016). The SiLago method: Next generation VLSI architectures and design methods. Paper presented at 4th ACM International Workshop on Many-Core Embedded Systems, MES 2016, 19 June 2016. Association for Computing Machinery (ACM), 18-22-June-2016.
Open this publication in new window or tab >>The SiLago method: Next generation VLSI architectures and design methods
2016 (English)Conference proceedings (editor) (Refereed)
Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2016
Series
ACM International Conference Proceeding Series
Identifiers
urn:nbn:se:kth:diva-197219 (URN)10.1145/2934495.2936779 (DOI)2-s2.0-84991088808 (Scopus ID)
Conference
4th ACM International Workshop on Many-Core Embedded Systems, MES 2016, 19 June 2016
Note

QC 20161207

Available from: 2016-12-07 Created: 2016-11-30 Last updated: 2016-12-07Bibliographically approved
Jafri, S. . A., Daneshtalab, M., Abbas, N., Serrano Leon, G. & Hemani, A. (2016). TransMap: Transformation Based Remapping and Parallelism for High Utilization and Energy Efficiency in CGRAs. I.E.E.E. transactions on computers (Print), 65(11), 3456-3469.
Open this publication in new window or tab >>TransMap: Transformation Based Remapping and Parallelism for High Utilization and Energy Efficiency in CGRAs
Show others...
2016 (English)In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 65, no 11, 3456-3469 p.Article in journal (Refereed) Published
Abstract [en]

In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping-. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., utilization and energy). To implement the dynamic remapping, reconfigurable architectures commonly store multiple (compile-time generated) implementations of an application. Each implementation represents a different platform location and/or degree of parallelism. The optimal implementation is selected at run-time. However, the compile-time binding either incurs excessive configuration memory overheads and/or is unable to map/parallelize an application even when sufficient resources are available. As a solution to this problem, we present Transformation based reMapping and parallelism (TransMap). TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application. Compared to state of the art, in addition to simple relocation in horizontal/vertical directions, TransMap also allows to rotate an application for mapping or parallelizing an application in resource constrained scenarios. By storing only a single implementation, TransMap offers significant reductions in configuration memory requirements (up to 73 percent for the tested applications), compared to state of the art compaction techniques. Simulation results reveal that the additional flexibility reduces the energy requirements by 33 percent and enhances the device utilization by 50 percent for the tested applications. Gate level analysis reveals that TransMap incurs negligible silicon (0.2 percent of the platform) and timing (6 additional cycles per application) penalty.

Place, publisher, year, edition, pages
IEEE, 2016
Keyword
Reconfigurable architectures, run-time remapping, energy aware systems
National Category
Computer Engineering
Identifiers
urn:nbn:se:kth:diva-198578 (URN)10.1109/TC.2016.2525981 (DOI)000388498000018 ()
Note

QC 20161219

Available from: 2016-12-19 Created: 2016-12-19 Last updated: 2018-01-13Bibliographically approved
Liu, P., Hemani, A. & Paul, K. (2015). 3D-stacked many-core architecture for biological sequence analysis problems. In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015 International Conference on. Paper presented at Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015 International Conference on (pp. 211-220). IEEE conference proceedings.
Open this publication in new window or tab >>3D-stacked many-core architecture for biological sequence analysis problems
2015 (English)In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015 International Conference on, IEEE conference proceedings, 2015, 211-220 p.Conference paper, Published paper (Refereed)
Abstract [en]

Sequence analysis plays critical role in bioinformatics, and most applications of which have compute intensive kernels consuming over 70% of total execution time. By exploiting the compute intensive execution stages of popular sequence analysis applications, we present and evaluate a VLSI architecture with a focus on those that target at biological sequences directly, including pairwise alignment, multiple sequence alignment, database search, and short read sequence mappings. Based on coarse grained reconfigurable array (CGRA) we propose the use of many-core and 3D-stacked technologies to gain further improvement over memory subsystem, which gives another order of magnitude speedup from high bandwidth and low access latency. We analyze our approach in terms of its throughput and efficiency for different application mappings. Initial experimental results are evaluated from a stripped down implementation in a commodity FPGA, and then we scale the results to estimate the performance of our architecture with 9 layers of 68 mm2 stacked wafers in 45-nm process. We demonstrate numerous estimated speedups better than any existed hardware accelerators for at least 39 times for the entire range of applications and datasets of interest. In comparison, the alternative FPGA based accelerators deliver only improvement for single application, while GPGPUs perform not well enough on accelerating program kernel with random memory access and integer addition/comparison operations.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2015
Keyword
VLSI;bioinformatics;field programmable gate arrays;multiprocessing systems;parallel architectures;reconfigurable architectures;3D-stacked many-core architecture;CGRA;VLSI architecture;bioinformatics;biological sequence analysis problems;coarse grained reconfigurable array;commodity FPGA;database search;memory subsystem;multiple sequence alignment;pairwise alignment;short read sequence mappings;Bioinformatics;Biology;Computational modeling;Computer architecture;Databases;Kernel;Sequences;Accelerator architectures;Application specific integrated circuits;Bioinformatics;Computational biology;Coprocessors;Reconfigurable architectures;Three-dimensional integrated circuits
National Category
Embedded Systems
Identifiers
urn:nbn:se:kth:diva-184234 (URN)10.1109/SAMOS.2015.7363678 (DOI)000380507900029 ()2-s2.0-84963665644 (Scopus ID)
External cooperation:
Conference
Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015 International Conference on
Note

QC 20160405

Available from: 2016-03-30 Created: 2016-03-30 Last updated: 2016-09-05Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0565-9376

Search in DiVA

Show all publications