Change search
Refine search result
1234 101 - 150 of 188
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 101. Khan, Mozammel H.
    et al.
    Hemani, Ahmed
    KTH, Superseded Departments, Electronic Systems Design.
    Tenhunen, Hannu
    KTH, Superseded Departments, Electronic Systems Design.
    Implementation-Independent Macrolibrary for Telecommunication in VHDL1996In: Proceedings of the Baltic Electronics Conference, 1996, p. 291-294Conference paper (Refereed)
  • 102.
    Kumar, Shashi
    et al.
    Indian Institute of Technology.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Ellervee, Peeter
    KTH, Superseded Departments, Electronic Systems Design.
    Hemani, Ahmed
    KTH, Superseded Departments, Electronic Systems Design.
    Kumar, Anshul
    KTH, Superseded Departments, Electronic Systems Design.
    Internal Representation for Specification and Design of Heterogeneous Systems1997In:  , 1997Conference paper (Refereed)
  • 103.
    Lansner, Anders
    et al.
    KTH, School of Computer Science and Communication (CSC), Computational Biology, CB.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Farahini, Nasim
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Spiking brain models: Computation, memory and communication constraints for custom hardware implementation2014In: 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), IEEE , 2014, p. 556-562Conference paper (Refereed)
    Abstract [en]

    We estimate the computational capacity required to simulate in real time the neural information processing in the human brain. We show that the computational demands of a detailed implementation are beyond reach of current technology, but that some biologically plausible reductions of problem complexity can give performance gains between two and six orders of magnitude, which put implementations within reach of tomorrow's technology.

  • 104. Lazraq, T.
    et al.
    Svantesson, Bengt
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Hemani, Ahmed
    KTH, Superseded Departments, Electronic Systems Design.
    Modeling of Operation and Maintenance Functions in the ATM Network1995In: Proceedings of ESM 1995, 1995Conference paper (Refereed)
  • 105.
    Leung, Simon
    et al.
    Department of CSEE, University of Queensland.
    Postula, Adam
    Department of CSEE, University of Queensland.
    Hemani, Ahmed
    KTH, Superseded Departments, Microelectronics and Information Technology, IMIT.
    Test strategies on functionally partitioned module-based programmable architecture for base-band processing2001In: Digital Systems, Design, 2001. Proceedings. Euromicro Symposium on, 2001, p. 326-333Conference paper (Refereed)
    Abstract [en]

    A specialised reconfigurable architecture for telecommunication base-band processing is augmented with testing resources. The routing network is linked via virtual wire hardware modules to reduce the area occupied by connecting buses. The number of switches within the routing matrices is also minimised, which increases throughput without sacrificing flexibility. The testing algorithm was developed to systematically search for faults in the processing modules and the flexible high-speed routing network within the architecture. The testing algorithm starts by scanning the externally addressable memory space and testing the master controller. The controller then tests every switch in the route-through switch matrix by making loops from the shared memory to each of the switches. The local switch matrix is also tested in the same way. Next the local memory is scanned. Finally, pre-defined test vectors are loaded into local memory to check the processing modules. This algorithm scans all possible paths within the interconnection network exhaustively and reports all faults. Strategies can be inserted to bypass minor faults

  • 106.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Chen, Guo
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A code reuse method for many-core coarse-grained reconfigurable architecture function library development2011In: 2011 International Symposium on Integrated Circuits, ISIC 2011, 2011, p. 512-515Conference paper (Refereed)
    Abstract [en]

    In this paper 1, a code reuse method is proposed to enhance the efficiency of the function library development of many core coarse-grained reconfigurable architecture. The method focuses on developing and using the precompiled ReCon-figurable Functions (RCFs) in the function library. By applying this method on the RCF development, functions are objectified like classes in any objective-oriented programming language. Using a function is to instantiate a selected RCF. Similar functions can be instantiated from the same RCF. Thus, the total number of RCFs to be compiled is reduced and the global programming efficiency is increased and the labor requirement for application development is reduced.

  • 107.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Farahini, Nasim
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Global control and storage synthesis for a system level synthesis approach2013In: Proceedings - 21st Annual International IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2013, IEEE , 2013, p. 6546036-Conference paper (Refereed)
    Abstract [en]

    SYLVA is a System Level Architectural Synthesis Framework that translates Synchronous Data Flow (SDF) models of DSP sub-systems like modems and codecs into hardware implementation in ASIC/Standard Cells, FPGAs or CGRAs (Coarse Grain Reconfigurable Fabric).

  • 108.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Farahini, Nasim
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Rosvall, Kathrin
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Sander, Ingo
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    System level synthesis of hardware for DSP applications using pre-characterized function implementations2013In: 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), IEEE , 2013Conference paper (Refereed)
    Abstract [en]

    SYLVA is a system level synthesis framework that transforms DSP sub-systems modeled as synchronous data flow into hardware implementations in ASIC, FPGAs or CGRAs. SYLVA synthesizes in terms of pre-characterized function implementations (FTMPs). It explores the design space in three dimensions, number of FTMPs, type of FTMPs and pipeline parallelism between the producing and consuming FTMPs. We introduce timing and interface model of FTMPs to enable reuse and automatic generation of Global Interconnect and Control (GLIC) to glue the FTMPs together into a working system. SYLVA has been evaluated by applying it to five realistic DSP applications and results analyzed for design space exploration, efficacy in generating GLIC by comparing to manually generated GLIC and accuracy of design space exploration by comparing the area and energy costs considered during the design space exploration based on pre-characterized FIMPs and the final results.

  • 109.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Accurate and efficient three level design space exploration based on constraints satisfaction optimization problem solver2014In: Proceedings - 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, 2014Conference paper (Refereed)
  • 110.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Case Study: Constraint Programming in a System Level Synthesis Framework2014In: PRINCIPLES AND PRACTICE OF CONSTRAINT PROGRAMMING, CP 2014, 2014, p. 846-861Conference paper (Refereed)
    Abstract [en]

    This article presents a case study of using a constraint programming solver in a system level synthesis framework called SYLVA. The solver is used to find the repetition vector of a synchronous data flow graph and serving as the design space exploration engine, which rapidly finds qualified system implementations by solving a constraint satisfaction optimization problem. Each system implementation is a combination of a number of function implementation instances and their cycle accurate execution schedules. The problem to be solved is automatically generated based on the user inputs: 1) a system model to be synthesized, 2) a library containing all the usable function implementations, 3) the performance/cost constraints, and 4) the optimization objectives. Use of constraints programming technique enabled a low cost development of design space exploration engine in addition to gaining ease of use.

  • 111.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Global interconnect and control synthesis in system level architectural synthesis framework2013In: Proceedings - 16th Euromicro Conference on Digital System Design, DSD 2013, New York: IEEE , 2013, p. 11-17Conference paper (Refereed)
    Abstract [en]

    In this paper, we describe the procedure of the Global Interconnect and Control (GLIC) synthesis step in a system level synthesis framework to automatically generate GLIC logics from a scheduled SDF. The generated GLIC logics consist of control FSMs, interconnect and data buffers to glue existing function implementations to construct the system, which is modeled by the scheduled SDF. The experimental result shows that GLIC synthesis is able to generate compact (5.7%, 0.6% and 0.9% of area usage for three examples implemented in 65nm ASIC) control, interconnect and data buffers while saving huge amount of manual effort and time (0.5s, 2.4s and 4.3s run time on a 2.8GHz x86 microprocessor for the three examples).

  • 112.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Memory allocation and optimization in system-level architectural synthesis2013In: 2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip, ReCoSoC 2013, New York: IEEE , 2013, p. 6581537-Conference paper (Refereed)
    Abstract [en]

    In this paper, we present a novel approach to optimally allocate memory resources in a system-level synthesis flow, which converts a dataflow style system description (synchronous data flow) into the register-transfer level description in the specified implementation style (ASIC, FPGA or CGRA). The first problem is encountered by the synthesis flow is that since it covers different implementation styles, a generic model is required to support resource allocation and optimization. The second problem is the memory allocation method to optimally allocate memory resources in the RTL model. The contribution of this paper has two parts, which are 1) a generic memory model for different memory architectures in ASIC, FPGA and CGRA, and 2) a memory allocation and optimization method for optimally allocating storage elements in the intermediate representation with actual implementations (e.g. on-chip SRAM for ASIC, memory controller and off-chip SDRAM for FPGA). The memory allocation method is an implementation style dependent procedure and has three steps: architecture independent optimization, resource allocation and architecture depended optimization. The experimental result shows that the proposed method is efficient and effective. The automatically generated implementation uses only approximately 4% more resources compared to manual implementation. The fast and automatic memory allocation method enables fast design space exploration that requires little effort form the system designer.

  • 113.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Three-Dimensional Design Space Exploration for System Level Synthesis2014In: 2014 17TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD), 2014, p. 419-426Conference paper (Refereed)
    Abstract [en]

    In this paper, we propose an efficient and effective three-dimensional design space exploration method for mapping a DSP system in synchronous data flow graph format onto an RTL or lower level hardware description using constraint programming. The three dimensions are 1) schedule level parallelism (The parallelism of the executions for one DSP function, fully parallel, semi-parallel or fully serial), 2) function level parallelism (how many function implementations are used to implement each of the DSP functions), and 3) arithmetic level parallelism (how the function implementations are implemented). The design space exploration problem is formulated as a constraints satisfaction optimization problem and solved by the constraint programming solver in Google's or-tools. The proposed method is compared against two state-of-the-art commercial HLS tools for four realistic examples and one synthetic example. The metrics compared are runtime, accuracy and quality of results in terms of resource usage. We show on average, the proposed method is 85.22% faster compared to HLS tools, 4.3% more accurate and 8.27% better in quality of results. For the latter we have conservatively assumed the same function execution parallelism.

  • 114.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Jafari, Fahimeh
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kumar, Shashi
    Department of Electronics and Computer Engineering, School of Engineering, Jönköping University.
    Layered Spiral Algorithm for memory-aware mapping and scheduling on Network-on-Chip2010In: 28th Norchip Conference, NORCHIP 2010, 2010Conference paper (Refereed)
    Abstract [en]

    In this paper, Layered Spiral Algorithm (LSA) is proposed for memory-aware application mapping and scheduling onto Network-on-Chip (NoC) based Multi-Processor System-on-Chip (MPSoC). The energy consumption is optimized while keeping high task level parallelism. The experimental evaluation indicates that if memory-awareness is not considered during mapping and scheduling, memory overflows may occur. The underlying problem is also modeled as a Mixed Integer Linear Programming (MILP) problem and solved using an efficient branch-and-bound algorithm to compare optimal solutions with results achieved by LSA. Comparing to MILP solutions, the LSA results demonstrate only about 20% and 12% increase of total communication cost in case of a small and middle size synthetic problem, respectively, while it is order of magnitude faster than the MILP solutions. Therefore, the LSA can find acceptable total communication cost with a low run-time complexity, enabling quick exploration of large design spaces, which is infeasible for exhaustive search.

  • 115.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Malik, Jamshaid Sarwar
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Liu, Shaoteng
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A code generation method for system-level synthesis on ASIC, FPGA and manycore CGRA2013In: MES '13 Proceedings of the First International Workshop on Many-core Embedded Systems, ACM , 2013, p. 25-32Conference paper (Refereed)
    Abstract [en]

    This paper presents a code generation method that translates an intermediate Register-Transfer Level (RTL) model of a system into its corresponding VHDL code for ASIC and FPGAs and MATLAB functions for manycores CGRAs. The intermediate representation consists of Function Implementation (FIMPs) and the glue logic. FIMPs are VHDL design units for the ASIC and FPGA implementation styles and MATLAB function templates for the CGRA implementation style, while the glue logic is a compact data structure storing Global Interconnect and Control (GLIC) information. The automatically generated implementation codes increase the resource usage by 1.5% on the average while reducing total design effort by two orders of magnitudes.

  • 116.
    Li, Shuo
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Malik, Omer
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Automatic test program generation framework for NoC-based MPSoC compiler validation2011In: 2011 International Conference on Instrumentation, Measurement, Circuits and Systems (ICIMCS 2011), vol 1: Instrumentation, Measurement, Circuits and Systems, New York: Amer Soc Mechanical Engineers , 2011, p. 99-103Conference paper (Refereed)
    Abstract [en]

    In this paper, we propose a systematic method (a framework) for automatic test program generation for Network-on-Chip (NoC) based Multi-Processor System-on-Chip (MPSoC) compiler validation. This framework consists of three parts: specification reader, program generator and platform simulator. By applying this framework, specified test programs for compiler validation are automatically generated as well as their corresponding run time results. The validation productivity is enhanced and the expertise requirement is reduced. We also present an example tool called Automatic VESYLA Generator (AVG) implementing this framework. This tool is used in the Dynamic Reconfigurable Resource Array (DRRA) assembler development in our research group. The experiment shows that on a personal PC, AVG tool generates bug-free test programs more than 100 times faster than a human programmer.

  • 117.
    Liu, Jia
    et al.
    KTH, School of Industrial Engineering and Management (ITM), Industrial Economics and Management (Dept.), Industrial Management.
    Li, Z.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Design of evaluation platform of machine vision for portable wireless terminals2011Conference paper (Refereed)
    Abstract [en]

    An evaluation platform for Machine vision algorithm is designed in this paper. The platform is constructed with DM6437 DSP processor and image input-output circuit models. An image process algorithm used for machine vision can be performed on the platform. With DFG model of the algorithm, the algorithm architecture can be built for programming and analyzing expediently. As an example the image segmentation algorithm has been modeled and executed with the platform. The result shows that the platform is useful for algorithm analysis and could be compared with other implementation system as design reference.

  • 118.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Ebrahim, Fatemeh Ostad
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Department of Computer Science and engineering, Indian Institute of Technology.
    A Coarse-Grained Reconfigurable Processor for Sequencing and Phylogenetic Algorithms in Bioinformatics2011In: Proceedings: 2011 International Conference on Reconfigurable Computing and FPGAs, ReConFig 2011, 2011, p. 190-197Conference paper (Refereed)
    Abstract [en]

    A coarse-grained reconfigurable processor tailoredfor accelerating multiple bioinformatics algorithms isproposed. In this paper, a programmable and scalablearchitectural platform instantiates an array of coarse grainedlight weight processing elements, which allows arbitrarypartitioning, scheduling schemes and capable of solvingcomplete four popular bioinformatics algorithms: theNeedleman-Wunsch, Smith-Waterman, and HMMER onsequencing, and Maximum Likelihood on phylogenetic. Thekey difference of the proposed CGRA based solution comparedto FPGA and GPU based solutions is a much better match onarchitecture and algorithms for the core computational needs,as well as the system level architectural needs. For the samedegree of parallelism, we provide a 5X to 14X speed-upimprovements compared to FPGA solutions and 15X to 78Xcompared to GPU acceleration on 3 sequencing algorithms. Wealso provide 2.8X speed-up compared to FPGA with the sameamount of core logic and 70X compared to GPU with the samesilicon area for Maximum Likelihood.

  • 119.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A Coarse Grain Reconfigurable Architecture for sequence alignment problems in bio-informatics2010In: Proceedings of the 2010 IEEE 8th Symposium on Application Specific Processors, SASP'10, 2010, p. 50-57Conference paper (Refereed)
    Abstract [en]

    A Coarse Grain Reconfigurable Architecture (CGRA) tailored for accelerating bio-informatics algorithms is proposed. The key innovation is a light weight bio-informatics processor that can be reconfigured to perform different Add Compare and Select operations of the popular sequencing algorithms. A programmable and scalable architectural platform instantiates an array of such processing elements and allows arbitrary partitioning and scheduling schemes and capable of solving complete sequencing algorithms including the sequential phases and deal with arbitrarily large sequences. The key difference of the proposed CGRA based solution compared to FPGA and GPU based solutions is a much better match of the architecture and algorithm for the core computational need as well as the system level architectural need. This claim is quantified for three popular sequencing algorithms: the Needleman-Wunsch, Smith-Waterman and HMMER. For the same degree of parallelism, we provide a 5 X and 15 X speed-up improvements compared to FPGA and GPU respectively. For the same size of silicon, the advantage grows by a factor of another 10 X.

  • 120.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Paul, Kolin
    Indian Institute of Technology, Delhi, India.
    3D-stacked many-core architecture for biological sequence analysis problems2015In: Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), 2015 International Conference on, IEEE conference proceedings, 2015, p. 211-220Conference paper (Refereed)
    Abstract [en]

    Sequence analysis plays critical role in bioinformatics, and most applications of which have compute intensive kernels consuming over 70% of total execution time. By exploiting the compute intensive execution stages of popular sequence analysis applications, we present and evaluate a VLSI architecture with a focus on those that target at biological sequences directly, including pairwise alignment, multiple sequence alignment, database search, and short read sequence mappings. Based on coarse grained reconfigurable array (CGRA) we propose the use of many-core and 3D-stacked technologies to gain further improvement over memory subsystem, which gives another order of magnitude speedup from high bandwidth and low access latency. We analyze our approach in terms of its throughput and efficiency for different application mappings. Initial experimental results are evaluated from a stripped down implementation in a commodity FPGA, and then we scale the results to estimate the performance of our architecture with 9 layers of 68 mm2 stacked wafers in 45-nm process. We demonstrate numerous estimated speedups better than any existed hardware accelerators for at least 39 times for the entire range of applications and datasets of interest. In comparison, the alternative FPGA based accelerators deliver only improvement for single application, while GPGPUs perform not well enough on accelerating program kernel with random memory access and integer addition/comparison operations.

  • 121.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Paul, Kolin
    Indian Institute of Technology, Delhi, India.
    A many-core hardware acceleration platform for short read mapping problem using distributed memory interface with 3D-stacked architecture2014In: 2014 International Symposium on System-on-Chip, SoC 2014, 2014, p. 1-8Conference paper (Refereed)
    Abstract [en]

    Next Generation Sequencing technologies produce huge amounts of short reads consisting randomly fragmented DNA base pair strings, while assembling poses a challenge on the mapping of short reads to a reference genome in terms of both sensitivity and execution time. In this paper, we propose a many-core hardware acceleration platform for short read mapping based on hash-index method, which benefit from a distributed memory interface with 3D-stacked architecture for local memory access. Our design provides an amazingly 45012 times speedup over software approach for single end short reads and 21102 times for paired end reads, while also beats similar single FPGA solution for 1466 times in case of single end reads.

  • 122.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Department of Computer Science and engineering, Indian Institute of Technology.
    A reconfigurable processor for phylogenetic inference2011In: VLSI Design (VLSI Design), 2011 24th International Conference on, IEEE , 2011, p. 226-231Conference paper (Refereed)
    Abstract [en]

    A reconfigurable processor tailored for accelerating Phylogenetic Inference is proposed. In this paper, a programmable and scalable architectural platform instantiates an array of coarse grained light weight processing elements, which allows arbitrary partitioning, scheduling schemes and capable of solving complete Maximum Likelihood algorithm with arbitrarily of large sequences. The key difference of the proposed CGRA based solution compared to FPGA and GPU based solutions is a much better match of the architecture and algorithm for the core computational need as well as the system level architectural need. For the same degree of parallelism, we provide a 2.27X speed-up improvements compared to FPGA with the same amount of logic, and an 81.87X speed-up improvements compared to GPU with the same silicon area respectively.

  • 123.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Improved Bioinformatics Processing Unit for Multiple Applications2012In: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2012, IEEE , 2012, p. 390-396Conference paper (Refereed)
    Abstract [en]

    This paper presents a coarse-grain reconfigurable unit for accelerating multiple widely used bioinformatics algorithms. Our design is a highly efficient, programmable bioinformatics processing unit, called BiCell v2. Based on a specialized multimode multiplier, this unit provides three different working modes, in order to accelerate four popular bioinformatics algorithms: Maximum Likelihood based phylogenetic inference, Needleman-Wunsch, Smith-Waterman, and HMMER in sequence alignment. BiCell v2 supports both single and double precision floating-point computation, which significantly increases the accuracy for bioinformatics algorithm acceleration but retains silicon area efficiency. Making use of this improved processing unit, our platform gives about 10X speedup compared to our previous design in single-precision, and 23.2X speedup comparing with GPGPU in the same precision.

  • 124.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Paul, Kolin
    Weis, Christian
    Jung, Matthias
    Wehn, Norbert
    3D-Stacked Many-Core Architecture for Biological Sequence Analysis Problems2017In: International journal of parallel programming, ISSN 0885-7458, E-ISSN 1573-7640, Vol. 45, no 6, p. 1420-1460Article in journal (Refereed)
    Abstract [en]

    Sequence analysis plays extremely important role in bioinformatics, and most applications of which have compute intensive kernels consuming over 70% of total execution time. By exploiting the compute intensive execution stages of popular sequence analysis applications, we present and evaluate a VLSI architecture with a focus on those that target at biological sequences directly, including pairwise sequence alignment, multiple sequence alignment, database search, and short read sequence mappings. Based on coarse grained reconfigurable array we propose the use of many-core and 3D-stacked technologies to gain further improvement over memory subsystem, which gives another order of magnitude speedup from high bandwidth and low access latency. We analyze our approach in terms of its throughput and efficiency for different application mappings. Initial experimental results are evaluated from a stripped down implementation in a commodity FPGA, and then we scale the results to estimate the performance of our architecture with 9 layers of stacked wafers in 45-nm process. We demonstrate numerous estimated speedups better than corresponding existed hardware accelerator platforms for at least 40 times for the entire range of applications and datasets of interest. In comparison, the alternative FPGA based accelerators deliver only improvement for single application, while GPGPUs perform not well enough on accelerating program kernel with random memory access and integer addition/comparison operations.

  • 125.
    Liu, Pei
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Paul, Kolin
    Weis, Christian
    Jung, Matthias
    Wehn, Norbert
    A Customized Many-Core Hardware Acceleration Platform for Short Read Mapping Problems Using Distributed Memory Interface with 3D-Stacked Architecture2017In: Journal of Signal Processing Systems, ISSN 1939-8018, E-ISSN 1939-8115, Vol. 87, no 3, p. 327-341Article in journal (Refereed)
    Abstract [en]

    Rapidly developing Next Generation Sequencing technologies produce huge amounts of short reads that consisting randomly fragmented DNA base pair strings. Assembling of those short reads poses a challenge on the mapping of reads to a reference genome in terms of both sensitivity and execution time. In this paper, we propose a customized many-core hardware acceleration platform for short read mapping problems based on hash-index method. The processing core is highly customized to suite both 2-hit string matching and banded Smith-Waterman sequence alignment operations, while distributed memory interface with 3D-stacked architecture provides high bandwidth and low access latency for highly customized dataset partitioning and memory access scheduling. Conformal with original BFAST program, our design provides an amazingly 45,012 times speedup over software approach for single-end short reads and 21,102 times for paired-end short reads, while also beats similar single FPGA solution for 1466 times in case of single end reads. Optimized seed generation gives much better sensitivity while the performance boost is still impressive.

  • 126. Malik, J. S.
    et al.
    Ben Slimane, Slimane
    KTH, School of Information and Communication Technology (ICT), Communication Systems, CoS, Radio Systems Laboratory (RS Lab).
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Gohar, N. D.
    Improving performance of fading channel simulators by use of Uniformly distributed Random Numbers2011In: 2010 IEEE International Symposium on Signal Processing and Information Technology, 2011, p. 91-96Conference paper (Refereed)
    Abstract [en]

    Filter-based fading channel simulators universally use White Gaussian Noise (WGN) to generate complex tap coefficients. In this work, we will show that replacing WGN source by Uniform Random Number Generator (URNG) results in improved simulation speed in case of software simulator; and reduced area/power in case of hardware simulator. We will verify, both analytically and through extensive simulations, that use of URNG does not cause any degradation in important simulator performance parameters like statistical properties, spectral shape, level crossing rate and bit error rate. To validate our analysis, we have designed fading channel simulators both in software and hardware. We will show that use of URNG causes 6 percent improvement in simulation time in case of software simulator. Similarly, we will demonstrate that for a hardware simulator, we obtain an improvement of 30 and 40 percent in area and power consumption respectively.

  • 127.
    Malik, Jamshaid
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Malik, J. N.
    NUST.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Gohar, N. D.
    NUST.
    An efficient hardware implementation of high quality AWGN generator using Box-Muller method2011In: 11th International Symposium on Communications and Information Technologies, ISCIT 2011, 2011, p. 449-454Conference paper (Refereed)
    Abstract [en]

    Box Muller (BM) algorithm is extensively used for generation of high quality Gaussian Random Numbers (GRNs) in hardware. Most efficient published implementation of BM method utilizes transformation of 32-bit data path to 16 bits and use of first degree piece-wise polynomial approximation to compute logarithmic and square root functions. In this work, we have performed extensive error analysis to show that coefficient memory for polynomial approximation can be reduced by more than 35 percent without compromising on quality of generated Gaussian samples. This also reduces complexity of corresponding address generator, which requires most hardware resources. We have also used more efficient and statistically accurate skip-ahead Linear Feedback Shift Registers to generate uniformly distributed numbers for the BM algorithm. Complete hardware implementation utilizes only 407 slices, 03 DSP blocks and 1.5 memory blocks on Xilinx Virtex-4 XC4VLX15 operating at 230 MHz while providing a tail accuracy of 6.6σ. This is better in terms of accuracy and hardware utilization than any of the previously reported architecture.

  • 128.
    Malik, Jamshaid
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Malik, J. N.
    NUST Pakistan.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Gohar, N. D.
    NUST Pakistan.
    Generating high tail accuracy Gaussian Random Numbers in hardware using central limit theorem2011In: 2011 IEEE/IFIP 19th International Conference on VLSI and System-on-Chip, VLSI-SoC 2011, 2011, p. 60-65Conference paper (Refereed)
    Abstract [en]

    An efficient hardware implementation of Gaussian Random Number (GRN) generator based on Central Limit Theorem (CLT) is presented. CLT, although very simple to implement, is never used to generate high quality Gaussian numbers. This is due to the fact that direct implementation of CLT provides very poor accuracy in tail regions of the probability density function. In this work, we have shown that it is possible to achieve high tail accuracy by empirically computing the error in CLT, which can be compensated with a simple correction algorithm. The error has been modeled as first degree piece-wise polynomial approximation, using a novel non-uniform segmentation algorithm to compute the coefficients of polynomial segments. A novel hardware architecture of GRN generator is presented which requires only 420 slices and 1 DSP block of Xilinx Virtex-4 XC4VLX15 operating at 220 MHz. This resource utilization is better than any of the previously reported designs. Demonstrated for the tail accuracy of 6σ, the GRN generator design is scalable to achieve even higher accuracy with minimal increase in hardware resources. The accuracy of GRN generator is validated using statistical goodness of fit tests.

  • 129.
    Malik, Jamshaid Sarwar
    et al.
    KTH.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Elektronics.
    Gaussian Random Number Generation: A Survey on Hardware Architectures2016In: ACM Computing Surveys, ISSN 0360-0300, E-ISSN 1557-7341, Vol. 49, no 3, article id 53Article in journal (Refereed)
    Abstract [en]

    Some excellent surveys of the Gaussian random number generators (GRNGs) from the algorithmic perspective exist in the published literature to date (e.g., Thomas et al. [2007]). In the last decade, however, advancements in digital hardware have resulted in an ever-decreasing hardware cost and increased design flexibility. Additionally, recent advances in applications like gaming, weather forecasting, and simulations in physics and astronomy require faster, cheaper, and statistically accurate GRNGs. These two trends have contributed toward the development of a number of novel GRNG architectures optimized for hardware design. A detailed comparative study of these hardware architectures has been somewhat missing in the published literature. This work provides the potential user a capsulization of the published hardware GRNG architectures. We have provided the method and theory, pros and cons, and a comparative summary of the speed, statistical accuracy, and hardware resource utilization of these architectures. Finally, we have complemented this work by describing two novel hardware GRNG architectures, namely, the CLT-inversion and the multihat algorithm, respectively. These new architectures provide high tail accuracy (6 sigma and 8 sigma, respectively) at a low hardware cost.

  • 130.
    Malik, Jamshaid Sarwar
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    On the Design of Doppler Filters for Next Generation Radio Channel Simulators2009In: 2009 3RD INTERNATIONAL CONFERENCE ON SIGNALS, CIRCUITS AND SYSTEMS (SCS 2009), NEW YORK: IEEE , 2009, p. 746-751Conference paper (Refereed)
    Abstract [en]

    Real-time wireless channel simulators are necessary for radio prototyping. Doppler filter is one of the basic building blocks in correlation-based channel simulators. Enormous computational complexity of channel models for new wireless standards like MIMOs prohibit their software realizations (which have traditionally been the case). In first part of this work, we dimension and compare two alternative implementations of the Doppler filter, one using FIR and the other using IIR. Next we provide hardware implementations to come up with area and power requirements for Doppler filters for channels as complicated as 10 x 10 MIMOs to conclude that 5(th) order IIR filters with 32-bit fixed-point MAC provide near optimum accuracy, area and power consumption and become a logical choice for hardware implementations for wireless channel simulators. We also provide FPGA implementation of our design to indicate their strength and potential role in future simulators. Finally we extrapolate our results to indicate future trends in wireless channel simulator implementation.

  • 131.
    Malik, Jamshaid Sarwar
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Gohar, N. D.
    National University of Sciences and Technology, SEECS, Pakistan.
    Unifying CORDIC and Box-Muller algorithms: An accurate and efficient Gaussian Random Number generator2013In: Proceedings Of The 2013 IEEE 24th International Conference On Application-Specific Systems, Architectures And Processors (ASAP 13), IEEE Computer Society, 2013, p. 277-280Conference paper (Refereed)
    Abstract [en]

    An efficient hardware implementation of Gaussian Random Number (GRN) generator based upon Box-Muller (BM) and CORDIC algorithms is presented. We will illustrate a novel hardware architecture with flexible design space that unifies the two algorithms. A major advantage of this work is that unlike any of the previously reported architectures, it is possible to eliminate hardware multipliers and memory blocks in the synthesized hardware. This is achieved without compromising on statistical accuracy of GRN generators which is proved both through error analysis and standard tests. We will also demonstrate two different hardware implementations that vary in terms of speed, tail accuracy (4.7σ to 9.4σ), and utilization of hardware resources such as DSP blocks, logic slices and memory blocks on FPGAs. Finally, we will present a comparison of designed architectures with previously published hardware GRN generators.

  • 132.
    Malik, Jamshaid Sarwar
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Malik, J. N.
    Silmane, Ben
    KTH, School of Information and Communication Technology (ICT), Communication Systems, CoS, Radio Systems Laboratory (RS Lab).
    Gohar, N. D.
    Revisiting central limit theorem: Accurate Gaussian random number generation in VLSI2015In: IEEE Transactions on Very Large Scale Integration (vlsi) Systems, ISSN 1063-8210, E-ISSN 1557-9999, Vol. 23, no 5, p. 842-855, article id 6834810Article in journal (Refereed)
    Abstract [en]

    Gaussian random numbers (GRNs) generated by central limit theorem (CLT) suffer from errors due to deviation from ideal Gaussian behavior for any finite number of additions. In this paper, we will show that it is possible to compensate the error in CLT, thereby correcting the resultant probability density function, particularly in the tail regions. We will provide a detailed mathematical analysis to quantify the error in CLT. This provides a design space with more than four degrees of freedom to build a variety of GRN generators (GRNGs). A framework utilizes this design space to generate customized hardware architectures. We will demonstrate designs of five different architectures of GRNGs, which vary in terms of consumed memory, logic slices, and multipliers on field-programmable gate array. Similarly, depending upon application, these architectures exhibit statistical accuracy from low (4 σ ) to extremely high (12 σ). A comparison with previously published designs clearly indicate advantages of this methodology in terms of both consumed hardware resources and accuracy. We will also provide synthesis results of same designs in application-specific integrated circuit using 65-nm standard cell library. Finally, we will highlight some shortcomings associated with such architectures followed by their remedies.

  • 133.
    Malik, Jamshaid Sarwar
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Palazzari, Paolo
    PLDA Italia, Rome, Italy.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Effort, resources, and abstraction vs performance in high-level synthesis: finding new answers to an old question2012In: SIGARCH Computer Architecture News, ISSN 0163-5964, E-ISSN 1943-5851, Vol. 40, no 5, p. 64-69Article in journal (Refereed)
    Abstract [en]

    This work provides new perspectives on impact of design effort,consumed resources and design abstraction on hardwareperformance in a high-level synthesis flow. We have shown thatcounter to published literature as well as intuition; more designeffort may not always result in better performance. We developeda kernel that simulates Brownian motion, and investigatedimprovement in hardware performance with design effort atvarious abstraction levels. Our results indicate that a designershould be careful in putting more effort at a particular abstractionlevel. In our case, we achieved best performance/effort ratio atalgorithm level rather than lower abstraction levels. This stronglysuggests that design effort is not always proportional tocorresponding improvement in performance.

  • 134.
    Malik, Jamshaid
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Slimane, Ben
    KTH, School of Information and Communication Technology (ICT), Communication Systems, CoS.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Gohar, N. D.
    NUST Pakistan.
    Impact of Interpolation Techniques on Statistical Properties and Speed of Fading Channel Simulators2010In: 6th International Conference on Wireless and Mobile Communications, ICWMC 2010, 2010, p. 117-124Conference paper (Refereed)
    Abstract [en]

    Interpolation filters are considered to be computationally intensive sections of fading channel simulators. They can be implemented by using linear interpolation or zero-padding followed by low-pass IIR or polyphase filtering. In this work, we will investigate how different interpolation techniques affect statistical properties and speed of the simulator.We will show that use of linear interpolation results in 3 to 6 times improvement in simulation speed while there is negligible degradation in desired statistical accuracy. We will validate this claim by designing a fading channel simulator that uses the above mentioned three interpolation techniques and observing their impact on its first order (probability density functions) and second order (correlation functions) statistical properties.We will also compare the impact of interpolation techniques on the level crossing rate (LCR) and bit error rate (BER) of the fading signal. Finally, we will emphasize our claim by using an advance multiple-tap channel model (gsmTUX6C1) with our simulator (using linear interpolation) and showing that its performance is comparable to corresponding Matlab model that uses polyphase interpolation. We will conclude this work with a recommendation to use linear interpolation for efficient and statistically correct fading channel simulators.

  • 135.
    Malik, Omer
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A pragma based approach for mapping MATLAB applications on a coarse grained reconfigurable architecture2012In: Proceedings - SBCCI 2012: 25th Symposium on Integrated Circuits and Systems Design, IEEE , 2012, p. 1-6Conference paper (Refereed)
    Abstract [en]

    This paper describes a tool that maps DSP functions written in MATLAB on a coarse grained reconfigurable platform with the help of pragmas. Pragmas are architectural hint directives that constraints the design implementation in terms of allocation/binding and explore specific features of platform. The pragmas are symbolic and parametric to make DSP functions generic in terms of the dimension. By sweeping these parameters, the tool can generate solutions of different dimension and varying degrees of parallelism from the same algorithmic level code. The tool performs scheduling, synchronization, control and interconnect generation for each such solution. This enables the pragma annotated functions to serve as generic library elements in a system level synthesis framework that sweeps through the parameters to select a function implementation of right size and optimal parallelism, thus enabling efficient system level design space exploration. In this paper, we describe these pragmas, their syntax, semantics and richly illustrate their usage with examples.

  • 136.
    Malik, Omer
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Synchronizing distributed state machines in a coarse grain reconfigurable architecture2011In: 2011 International Symposium on System on Chip, SoC 2011, 2011, p. 128-135Conference paper (Refereed)
    Abstract [en]

    This work presents methodology for synchronizing distributed FSMs (Finite State Machines) which are generated while implementing different algorithms on a coarse grain reconfigurable architecture. These FSMs interact with each other while executing algorithms and they are dependent upon each other; thus they need to be synchronized with each other for performing correct execution. The algorithms presented in this paper makes appropriate use of different strategies available for synchronizing these FSMs. The tool hides all sorts of low level details from the Programmer. It lets the designer focus on the details of algorithm (at higher level of abstraction) and cycle by cycle timings are resolved automatically.

  • 137.
    Malik, Omer
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A Library Development Framework for a Coarse Grain Reconfigurable Architecture2011In: VLSI Design (VLSI Design), 2011 24th International Conference on, 2011, p. 153-158Conference paper (Refereed)
    Abstract [en]

    A framework for efficiently capturing the rich microarchitectural space of a substantial Matlab like library of DSP functions for a regular Coarse Grain Reconfigurable Architecture (CGRA) fabric is proposed. A subset of C has been proposed to model the DSP functions and an automatic tool to generate the configware for the CGRA fabric developed. A method to estimate the average energy of such functions is reported with error margin of less than 3%. Such a framework is proposed as the basis for raising the abstraction to automate synthesis of the entire physical layers.

  • 138.
    Malik, Omer
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Shami, Muhammad Ali
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    High Level Synthesis Framework for a Coarse Grain Reconfigurable Architecture2010In: 28th Norchip Conference, NORCHIP 2010, 2010, p. 5669439-Conference paper (Refereed)
    Abstract [en]

    A High Level Synthesis Framework for mapping DSP algorithms on a Coarse Grain Reconfigurable Architecture is presented. Behavioral specification of the algorithm in C is specified with pragmas in comments and the tool generates configware after performing timing and synchronization synthesis. Pragmas identify SIMD type concurrency and sweep the architectural space with allocation and binding annotations to produce implementations from fully serial to fully parallel. This allows user to stay at algorithmic level and guide the HLS tool to search a restricted architectural space bounded by the pragmas thus making the synthesis process more efficient and predictable.

  • 139. Ngyen, T.
    et al.
    Jafri, Syed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. Turku Centre for Computer Science, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland .
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Dytckov, Sergei
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland .
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    FIST: A framework to interleave spiking neural networks on CGRAs2015In: Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015, IEEE , 2015, p. 751-758Conference paper (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern embedded applications. In many application domains (e.g. robotics and cognitive embedded systems), the CGRAs are required to simultaneously host processing (e.g. Audio/video acquisition) and estimation (e.g. audio/video/image recognition) tasks. Recent works have revealed that the efficiency and scalability of the estimation algorithms can be significantly improved by using neural networks. However, existing CGRAs commonly employ homogeneous processing resources for both the tasks. To realize the best of both the worlds (conventional processing and neural networks), we present FIST. FIST allows the processing elements and the network to dynamically morph into either conventional CGRA or a neural network, depending on the hosted application. We have chosen the DRRA as a vehicle to study the feasibility and overheads of our approach. Synthesis results reveal that the proposed enhancements incur negligible overheads (4.4% area and 9.1% power) compared to the original DRRA cell.

  • 140.
    Nidhi, U.
    et al.
    Indian Institute of Technology, Delhi, India.
    Paul, Kolin
    Indian Institute of Technology, Delhi, India.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Kumar, Anshul
    Indian Institute of Technology, Delhi, India.
    High performance 3D-FFT implementation2013In: Circuits and Systems (ISCAS), 2013 IEEE International Symposium on, IEEE , 2013, p. 2227-2230Conference paper (Refereed)
    Abstract [en]

    3D FFT is a very data and compute intensive kernel encountered in many applications. We report a high performance design and implementation of 3D-FFT on a CGRA which supports partial reconfiguration. The hardware software multi clock design uses dynamic reconfiguration to reduce the required communication bandwidth to achieve a sustained throughput of 40 GOPS on a wordsize of 48 bits. Performance metrics including overheads and speed over software for implementations of up to 256 point 3D-FFT have been presented in the paper.

  • 141.
    O’Nils, Mattias
    et al.
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Hemani, Ahmed
    KTH, Superseded Departments, Electronic Systems Design.
    Tenhunen, Hannu
    KTH, Superseded Departments, Electronic Systems Design.
    Interactive Hardware-Software Partitioning and Memory Allocation Based on Data Transfer Profiling1995In: Proceeding of International Conference on Recent Advances in Mechatronics, 1995Conference paper (Refereed)
  • 142.
    O’Nils, Mattias
    et al.
    KTH, Superseded Departments, Electronic Systems Design.
    Tammemäe, Kalle
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Hemani, Ahmed
    KTH, Superseded Departments, Electronic Systems Design.
    Design of D-AMPS Channel Decoder with Codesign Methodologies1996In: Proceedings of the Baltic Electronics Conference, 1996, p. 397-400Conference paper (Refereed)
  • 143.
    O’Nils, Mattias
    et al.
    KTH, Superseded Departments, Electronic Systems Design.
    Tammemäe, Kalle
    KTH, Superseded Departments, Electronic Systems Design.
    Jantsch, Axel
    KTH, Superseded Departments, Electronic Systems Design.
    Hemani, Ahmed
    KTH, Superseded Departments, Electronic Systems Design.
    Tenhunen, Hannu
    KTH, Superseded Departments, Electronic Systems Design.
    Experiences using Akka: A Hardware-Software Codesign Tool Kit in design of Telecommunication systems1995Conference paper (Refereed)
  • 144.
    Penolazzi, Sandro
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Badawi, Mohammad
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A Step Beyond TLM: Inferring Architectural Transactions at Functional Untimed Level2008In: IFIP/IEEE VLSI-SoC 2008 International Conference: 16th International Conference on Very Large Scale Integration, 2008, p. 505-509Conference paper (Other academic)
  • 145.
    Penolazzi, Sandro
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Bolognino, Luca
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Energy and Performance Model of a SPARC Leon3 Processor2009In: PROCEEDINGS OF THE 2009 12TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN, ARCHITECTURES, METHODS AND TOOLS, LOS ALAMITOS: IEEE COMPUTER SOC , 2009, p. 651-656Conference paper (Refereed)
    Abstract [en]

    We present a general methodology to implement a processor energy model, based on instruction-level characterization, and we apply it to a SPARC-based Leon3 processor. The model is characterized by simulating back-annotated gate-level netlist and has two levels of accuracy: a coarse-grain estimation based on characterizing each single instruction and a fine-grain estimation accounting for the impact of instructions interdependency on energy and based on characterizing pairs of instructions together. Our investigation also keeps into account the effect that both data switching activity and registers correlation have on energy. We validate our model by applying it to a set of instruction traces generated by Instruction Set Simulation and compare it to extracting energy directly from gate level. We achieve a worst-case error similar or equal to 12% and a speedup higher than 1000 times.

  • 146.
    Penolazzi, Sandro
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A layered approach to estimating power consumption2006In: 24th Norchip Conference, Proceedings / [ed] Johansson, T, IEEE , 2006, p. 93-98Conference paper (Refereed)
    Abstract [en]

    A layered approach to estimating power consumption at the highest level of abstraction is presented. This approach is sufficiently accurate and fast enough to be used as guide for exploring the algorithmic and architectural space. The layers span from use-case level down to gate level. Speed and accuracy come from our ability to relate parameterized transactions at architectural level to switching activity at gate level and to perform architecturally-aware application-level simulation for specific or sweeps of use-cases. That enables us to recreate accurately architectural-level transactions. Additionally, we use preliminary floorplan to factor physical design aspects to improve the accuracy of our estimates. We base our work on the industry standard SPIRIT for specifying IPs and Platforms. Early results of work are also presented.

  • 147.
    Penolazzi, Sandro
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Badawi, Mohammad
    Modelling Embedded Systems at Functional Untimed Application Level2007In: IP Conference (IP’07), 2007, p. 107-112Conference paper (Other academic)
  • 148.
    Penolazzi, Sandro
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    Bolognino, Luca
    KTH, School of Information and Communication Technology (ICT), Electronic, Computer and Software Systems, ECS.
    A General Approach to High-Level Energy and Performance Estimation in SoCs2009In: 22ND INTERNATIONAL CONFERENCE ON VLSI DESIGN HELD JOINTLY WITH 8TH INTERNATIONAL CONFERENCE ON EMBEDDED SYSTEMS, PROCEEDINGS   , 2009, p. 200-205Conference paper (Refereed)
    Abstract [en]

    We present a high-level methodology for efficient and accurate estimation of energy and performance in SoCs at Functional Untimed Level. We then validate the proposed method against gate level for accuracy and against TLM-PV for speed. We show that the method is within 15% of gate-level accuracy and in aver-age 28x faster than TLM-PV, for the benchmark applications selected.

  • 149.
    Penolazzi, Sandro
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Bolognino, Luca
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    A general approach to high-level energy and performance estimation in system-on-chip architectures2009In: Journal of Low Power Electronics, ISSN 1546-1998, Vol. 5, no 3, p. 373-384Article in journal (Refereed)
    Abstract [en]

    We present a high-level methodology for efficient and accurate estimation of energy and performance in SoCs. Differently from the most common approaches, which rely on Transaction-Level Modeling (TLM), we infer energy and performance figures directly from the Functional Untimed Level, by running the algorithmic specification natively on a common host machine. We then validate the proposed method against gate level for accuracy and against TLM-PV for speed. We show that the method is within 17% of gate-level accuracy and in average 28x faster than TLM-PV, for the benchmark applications selected.

  • 150.
    Penolazzi, Sandro
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Sander, Ingo
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Inferring energy and performance cost of RTOS in priority-driven scheduling2010In: 5th International Symposium on Industrial Embedded Systems, SIES 2010, 2010, p. 1-8Conference paper (Other academic)
1234 101 - 150 of 188
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf