Change search
Refine search result
12 1 - 50 of 63
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Akbari, N.
    et al.
    Modarressi, M.
    Daneshtalab, Masoud
    KTH.
    Loni, Efisio
    KTH.
    A Customized Processing-in-Memory Architecture for Biological Sequence Alignment2018In: Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors, Institute of Electrical and Electronics Engineers Inc. , 2018, Vol. 2018, article id 8445124Conference paper (Refereed)
    Abstract [en]

    Sequence alignment is the most widely used operation in bioinformatics. With the exponential growth of the biological sequence databases, searching a database to find the optimal alignment for a query sequence (that can be at the order of hundreds of millions of characters long) would require excessive processing power and memory bandwidth. Sequence alignment algorithms can potentially benefit from the processing power of massive parallel processors due their simple arithmetic operations, coupled with the inherent fine-grained and coarse-grained parallelism that they exhibit. However, the limited memory bandwidth in conventional computing systems prevents exploiting the maximum achievable speedup. In this paper, we propose a processing-in-memory architecture as a viable solution for the excessive memory bandwidth demand of bioinformatics applications. The design is composed of a set of simple and lightweight processing elements, customized to the sequence alignment algorithm, integrated at the logic layer of an emerging 3D DRAM architecture. Experimental results show that the proposed architecture results in up to 2.4x speedup and 41% reduction in power consumption, compared to a processor-side parallel implementation.

  • 2. Aldinucci, Marco
    et al.
    Brorsson, Mats
    KTH, School of Information and Communication Technology (ICT), Software and Computer systems, SCS.
    D'Agostino, Daniele
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics, Electronic and embedded systems.
    Kilpatrick, Peter
    Leppanen, Ville
    Preface2017In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 31, no 3, p. 179-180Article in journal (Refereed)
  • 3. Anwar, Hassan
    et al.
    Jafri, Syed Mohammad Asad Hassan
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Sergei, Dytckov
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Exploring Spiking Neural Network on Coarse-Grain Reconfigurable Architectures2014In: ACM International Conference Proceeding Series, 2014, p. 64-67Conference paper (Refereed)
    Abstract [en]

    Today, reconfigurable architectures are becoming increas- ingly popular as the candidate platforms for neural net- works. Existing works, that map neural networks on re- configurable architectures, only address either FPGAs or Networks-on-chip, without any reference to the Coarse-Grain Reconfigurable Architectures (CGRAs). In this paper we investigate the overheads imposed by implementing spiking neural networks on a Coarse Grained Reconfigurable Ar- chitecture (CGRAs). Experimental results (using point to point connectivity) reveal that up to 1000 neurons can be connected, with an average response time of 4.4 msec.

  • 4. Bakhouya, Mohamed
    et al.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Palesi, Maurizio
    Ghasemzadeh, Hassan
    Many-core System-on-Chip: architectures and applications2016In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436, Vol. 43, p. 1-3Article in journal (Refereed)
  • 5.
    Dalarsson, Masoud
    et al.
    KTH, School of Electrical Engineering (EES), Electromagnetic Engineering.
    Mittra, R.
    Analytical approach to modeling flat lenses with continuously graded profiles2015In: 2015 USNC-URSI Radio Science Meeting (Joint with AP-S Symposium), USNC-URSI 2015 - Proceedings, IEEE , 2015Conference paper (Refereed)
    Abstract [en]

    In this work we present an analytical approach to deriving the field solutions for a class of flat lenses that have attracted the attentions of antenna designers and researchers alike. The lens designs typically consist of a number of layers of graded index dielectrics, whose properties may vary in both the radial and longitudinal directions. The fields propagating in the longitudinal direction through the central layer primarily contribute to the bulk of the phase, while the side layers act as matching layers and help reduce the reflections originating at the interfaces of the middle layer. We model such lenses as compact composites with material properties characterized by continuous permittivity and permeability functions, which tend asymptotically to unity at the boundaries of the composite cylinder.

  • 6.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Bagherzadeh, Nader
    Sarbazi-Azad, Hamid
    On-chip parallel and network-based systems Preface2015In: Integration, ISSN 0167-9260, E-ISSN 1872-7522, Vol. 50, p. 137-138Article in journal (Refereed)
  • 7.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Bagherzadeh, Nader
    Sarbazi-Azad, Hamid
    Special issue on on-chip parallel and network-based systems2015In: Computing, ISSN 0010-485X, E-ISSN 1436-5057, Vol. 97, no 6, p. 539-541Article in journal (Other academic)
  • 8.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Turku, Finland.
    Ebrahimi, Masoumeh
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Turku, Finland.
    Dytckov, Sergei
    Plosila, Juha
    In-order delivery approach for 2D and 3D NoCs2015In: Journal of Supercomputing, ISSN 0920-8542, E-ISSN 1573-0484, Vol. 71, no 8, p. 2877-2899Article in journal (Refereed)
    Abstract [en]

    In many applications, it is critical to guarantee the in-order delivery of requests from the master cores to the slave cores, so that the requests can be executed in the correct order without requiring buffers. Since in NoCs packets may use different paths and on the other hand traffic congestion varies on different routes, the in-order delivery constraint cannot be met without support. To guarantee the in-order delivery, traditional approaches either use dimension-order routing or employ reordering buffers at network interfaces. Dimension-order routing degrades the performance considerably while the usage of reordering buffers imposes large area overhead. In this paper, we present a mechanism allowing packets to be routed through multiple paths in the network, helping to balance the traffic load while guaranteeing the in-order delivery. The proposed method combines the advantages of both deterministic and adaptive routing algorithms. The simple idea is to use different deterministic algorithms for independent flows. This approach neither requires reordering buffers nor limits packets to use a single path. The algorithm is simple and practical with negligible area overhead over dimension-order routing. The concept is investigated in both 2D and 3D mesh networks.

  • 9.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Mehdipour, Farhad
    Yu, Zhiyi
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics.
    Special Issue on Emerging Many-Core Systems for Exascale Computing2015In: ACM Journal on Emerging Technologies in Computing Systems, ISSN 1550-4832, Vol. 11, no 4, article id 39Article in journal (Other academic)
  • 10.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Palesi, M.
    Message from the chairs2016Conference proceedings (editor) (Refereed)
  • 11.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Palesi, M.
    Sonntag, S.
    Angiolini, F.
    Message from the chairs2015In: ACM International Conference Proceeding, ACM Press, 2015, Vol. 13-17-June-2015Conference paper (Refereed)
  • 12.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland .
    Palesi, Maurizio
    Mak, Terrence
    Introduction to the Special Issue on Network-on-Chip Architectures2014In: Computers & electrical engineering, ISSN 0045-7906, E-ISSN 1879-0755, Vol. 40, no 8, p. 257-259Article in journal (Other academic)
  • 13.
    Daneshtalab, Masoud
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Palesi, Maurizio
    Mak, Terrence
    Introduction to the special issue on NoC-based many-core architectures2015In: Computers & electrical engineering, ISSN 0045-7906, E-ISSN 1879-0755, Vol. 45, p. 359-361Article in journal (Other academic)
  • 14. Dytckov, S.
    et al.
    Purohit, S. S.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Plosila, J.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics.
    Exploring NoC jitter effect on simulation of spiking neural networks2014In: Proceedings of the 2014 International Conference on High Performance Computing and Simulation, HPCS 2014, 2014, p. 693-696Conference paper (Refereed)
    Abstract [en]

    The major bottleneck in simulation of large-scale neural networks is the communication problem due to one-to-many neuron connectivity. Network-on-Chip concept has been proposed to address the problem. This work explores the drawback that is introduced by interconnection networks - a delay jitter. The preliminary experiment is held in the spiking neural network simulator introducing variable communicational delay to the simulation. The performance degradation is reported.

  • 15.
    Dytckov, Sergei
    et al.
    University of Turku, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Ebrahimi, Masoumeh
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics. University of Turku, Finland.
    Anwar, Hassan
    Ecole Polytechnique Montreal, Canada.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics. University of Turku, Finland.
    Efficient STDP Micro-Architecture for Silicon Spiking Neural Networks2014In: 2014 17TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD), 2014, p. 496-503Conference paper (Refereed)
    Abstract [en]

    Spiking neural networks (SNNs) are the closest approach to biological neurons in comparison with conventional artificial neural networks (ANN). SNNs are composed of neurons and synapses which are interconnected with a complex pattern. As communication in such massively parallel computational systems is getting critical, the network-on-chip (NoC) becomes a promising solution to provide a scalable and robust interconnection fabric. However, using NoC for large-scale SNNs arises a trade-off between scalability, throughput, neuron/router ratio (cluster size), and area overhead. In this paper, we tackle the trade-off using a clustering approach and try to optimize the synaptic resource utilization. An optimal cluster size can provide the lowest area overhead and power consumption. For the learning purposes, a phenomenon known as spike-timing-dependent plasticity (STDP) is utilized. The micro-architectures of the network, clusters, and the computational neurons are also described. The presented approach suggests a promising solution of integrating NoCs and STDP-based SNNs for the optimal performance based on the underlying application.

  • 16.
    Ebrahimi, Masoumeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Chang, Xin
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Plosila, Juha
    Univ Turku, Dept Informat Technol, SF-20500 Turku, Finland..
    In-Order Delivery Approach for 3D NoCs2013In: 2013 17TH CSI INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND DIGITAL SYSTEMS (CADS 2013), IEEE , 2013, p. 87-+Conference paper (Refereed)
    Abstract [en]

    Routing algorithms can be classified into deterministic and adaptive methods. In deterministic methods, a single path is selected for each pair of source and destination nodes, and thus they are unable to distribute the traffic load over the network. Using deterministic routing, packets reach a destination in the same order they are delivered from a source node. Adaptive routing algorithms can greatly improve the performance by distributing packets over different routes. However, it requires a mechanism to reorder packets at destinations. Thereby, a large reordering buffer and a complex control mechanism are required at each node. This motivated us to propose a method guaranteeing in-order delivery while sending packets through alternative paths. The proposed method combines the advantages of both deterministic and adaptive routing algorithms. We introduce several routing algorithms working together in the network without creating cycles. By using these algorithms, packets of different flows use different routes while packets belonging to the same flow follow a single path. In this way, traffic is distributed over the network while addressing in-order delivery. We employ this approach on three-dimensional Networks-on-Chip.

  • 17.
    Ebrahimi, Masoumeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. Department of Information Technology, University of Turku, Finland .
    Daneshtalab, Masoud
    Plosila, Juha
    Fault-tolerant routing algorithm for 3D NoC using hamiltonian path strategy2013In: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2013, 2013, p. 1601-1604Conference paper (Refereed)
    Abstract [en]

    While Networks-on-Chip (NoC) have been increasing in popularity with industry and academia, it is threatened by the decreasing reliability of aggressively scaled transistors. In this paper, we address the problem of faulty elements by the means of routing algorithms. Commonly, fault-tolerant algorithms are complex due to supporting different fault models while preventing deadlock. When moving from 2D to 3D network, the complexity increases significantly due to the possibility of creating cycles within and between layers. In this paper, we take advantages of the Hamiltonian path to tolerate faults in the network. The presented approach is not only very simple but also able to support almost all one-faulty unidirectional links in 2D and 3D NoCs.

  • 18.
    Ebrahimi, Masoumeh
    et al.
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics. University of Turku, Finland .
    Wang, J.
    Huang, L.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland .
    Jantsch, Axel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Rescuing healthy cores against disabled routers2014Conference paper (Refereed)
    Abstract [en]

    A router may be temporarily or permanently disabled in NoCs for several reasons such as saving power, occurring faults or testing. Disabling a router, however, may have a severe impact on the performance or functionality of the entire system if it results in disconnecting the core from the network. In this paper, we propose a deadlock-free routing algorithm which allows the core to stay connected to the system and continue its normal operation when its connected router is disabled. Our analysis and experiments show that the proposed technique has 100%, 93.60%, and 87.19% network availability by 100% packet delivery when 1, 2 and 3 routers are defunct or intentionally disabled. The algorithm provides adaptivity and it is lightweight, requiring one and two virtual channels along the X and Y dimension, respectively.

  • 19. Firuzan, A.
    et al.
    Modarressi, M.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Reconfigurable communication fabric for efficient implementation of neural networks2015In: 10th International Symposium on Reconfigurable and Communication-centric Systems-on-Chip, Institute of Electrical and Electronics Engineers (IEEE), 2015, article id 7238097Conference paper (Refereed)
    Abstract [en]

    Handling heavy multicast-based inter-neuron communication is the most challenging issue in parallel implementation of neural networks. To address this problem, a reconfigurable Network-on-Chip (NoC) architecture for neural networks is presented in this paper. The NoC consists of a number of node clusters with a fix topology connected by a reconfigurable inter-cluster communication fabric that efficiently handles multicast communication. The evaluation results show that the proposed architecture can better manage the multicast-based traffic of neural networks than the mesh-based topologies proposed in prior work. It offers up to 60% and 22% lower average message latency compared to a baseline and a state-of-the-Art NoC for neural networks, respectively, which directly translates to faster neural processing.

  • 20. Hojabr, Reza
    et al.
    Modarressi, Mehdi
    Daneshtalab, Masoud
    Yasoubi, Ali
    Khonsari, Ahmad
    Customizing Clos Network-on-Chip for Neural Networks2017In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 66, no 11, p. 1865-1877Article in journal (Refereed)
  • 21. Huang, L. -T
    et al.
    Dong, H.
    Wang, J. -S
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Li, G. -J
    WeNA: Deterministic Run-time Task Mapping for Performance Improvement in Many-core Embedded Systems2015In: IEEE Embedded Systems Letters, ISSN 1943-0663, Vol. 7, no 4, p. 93-96, article id 7097665Article in journal (Refereed)
    Abstract [en]

    Many-core embedded systems will feature an extremely dynamic workload distribution where massive applications arranged as an unpredictable sequence enter and leave the system at run-time. Efficient mapping strategy is required to allocate system resources to the incoming application. Noncontiguous mapping improves system throughput by utilizing disjoint nodes, however, the increasing communication distance and external congestion lead to high power consumption and network delay. This paper thus presents an enhanced noncontiguous dynamic mapping algorithm, aiming at decreasing interprocessor communication overhead and improving both network and application performance. Communication volumes are utilized to arrange the mapping order of tasks belong to the same application. Moreover, expanding parameter of each task is developed which directs the optimized mapping decision comparing to the current neighborhood and occupancy information. Experimental results show that our modified mapping algorithm Weighted-based Neighborhood Allocation (WeNA) makes considerable improvements on Average Weighted Manhattan Distance (8.06%) and network latency (9.8%) in comparison with the state-of-the-art algorithm.

  • 22. Huang, Letian
    et al.
    Wang, Junshi
    Ebrahimi, Masoumeh
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Zhang, Xiaofan
    Li, Guangjun
    Jantsch, Axel
    Non-Blocking Testing for Network-on-Chip2016In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 65, no 3, p. 679-692Article in journal (Refereed)
    Abstract [en]

    To achieve high reliability in on-chip networks, it is necessary to test the network as frequently as possible to detect physical failures before they lead to system-level failures. A main obstacle is that the circuit under test has to be isolated, resulting in network cuts and packet blockage which limit the testing frequency. To address this issue, we propose a comprehensive network-level approach which could test multiple routers simultaneously at high speed without blocking or dropping packets. We first introduce a reconfigurable router architecture allowing the cores to keep their connections with the network while the routers are under test. A deadlock-free and highly adaptive routing algorithm is proposed to support reconfigurations for testing. In addition, a testing sequence is defined to allow testing multiple routers to avoid dropping of packets. A procedure is proposed to control the behavior of the affected packets during the transition of a router from the normal to the testing mode and vice versa. This approach neither interrupts the execution of applications nor has a significant impact on the execution time. Experiments with the PARSEC benchmarks on an 8x8 NoC-based chip multiprocessors show only 3 percent execution time increase with four routers simultaneously under test.

  • 23.
    Jafri, Syed M. A. H.
    et al.
    KTH, School of Information and Communication Technology (ICT).
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Abbas, Naeem
    Serrano Leon, Guillermo
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    TransMap: Transformation Based Remapping and Parallelism for High Utilization and Energy Efficiency in CGRAs2016In: I.E.E.E. transactions on computers (Print), ISSN 0018-9340, E-ISSN 1557-9956, Vol. 65, no 11, p. 3456-3469Article in journal (Refereed)
    Abstract [en]

    In the era of platforms hosting multiple applications with arbitrary inter application communication and computation patterns, compile time mapping decisions are neither optimal nor desirable. As a solution to this problem, recently proposed architectures offer run-time remapping-. The run-time remapping techniques displace or parallelize/serialize an application to optimize different parameters (e.g., utilization and energy). To implement the dynamic remapping, reconfigurable architectures commonly store multiple (compile-time generated) implementations of an application. Each implementation represents a different platform location and/or degree of parallelism. The optimal implementation is selected at run-time. However, the compile-time binding either incurs excessive configuration memory overheads and/or is unable to map/parallelize an application even when sufficient resources are available. As a solution to this problem, we present Transformation based reMapping and parallelism (TransMap). TransMap stores only a single implementation and applies a series for transformations to the stored bitstream for remapping or parallelizing an application. Compared to state of the art, in addition to simple relocation in horizontal/vertical directions, TransMap also allows to rotate an application for mapping or parallelizing an application in resource constrained scenarios. By storing only a single implementation, TransMap offers significant reductions in configuration memory requirements (up to 73 percent for the tested applications), compared to state of the art compaction techniques. Simulation results reveal that the additional flexibility reduces the energy requirements by 33 percent and enhances the device utilization by 50 percent for the tested applications. Gate level analysis reveals that TransMap incurs negligible silicon (0.2 percent of the platform) and timing (6 additional cycles per application) penalty.

  • 24.
    Jafri, Syed M. A. H.
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Abbas, N.
    Awan, M. A.
    Plosila, J.
    TEA: Timing and Energy Aware compression architecture for Efficient Configuration in CGRAs2015In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436Article in journal (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer time-multiplexing and dynamic applications parallelism to enhance device utilization and reduce energy consumption at the cost of additional memory (up to 50% area of the overall platform). To reduce the memory overheads, novel CGRAs employ either statistical compression, intermediate compact representation, or multicasting. Each compaction technique has different properties (i.e. compression ratio, decompression time and decompression energy) and is best suited for a particular class of applications. However, existing research only deals with these methods separately. Moreover, they only analyze the compaction ratio and do not evaluate the associated energy overheads. To tackle these issues, we propose a polymorphic compression architecture that interleaves these techniques in a unique platform. The proposed architecture allows each application to take advantage of a separate compression/decompression hierarchy (consisting of various types and implementations of hardware/software decoders) tailored to its needs. Simulation results, using different applications (FFT, Matrix multiplication, and WLAN), reveal that the choice of compression hierarchy has a significant impact on compression ratio (up to 52%), decompression energy (up to 4 orders of magnitude), and configuration time (from 33. n to 1.5. s) for the tested applications. Synthesis results reveal that introducing adaptivity incurs negligible additional overheads (1%) compared to the overall platform area.

  • 25.
    Jafri, Syed M. A. H.
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Tajammul, Adeel
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Indian Institute of Technology.
    Ellervee, Peeter
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Morphable Compression Architecture for Efficient Configuration in CGRAs2014In: 2014 17th Euromicro Conference on Digital System Design (DSD), 2014, p. 42-49Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads (up to 50% area of the overall platform). As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties (i.e. compression ratio and decoding time), and is therefore best suited for a particular class of applications (and situation). However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform. The proposed architecture allows each application to enjoy a separate compression/decompression hierarchy (consisting of various types and implementations of hardware/software decoders) tailored to its needs. Thereby, our solution offers minimal memory while meeting the required configuration deadlines. Simulation results, using different applications (FFT, Matrix multiplication, and WLAN), reveal that the choice of compression hierarchy has a significant impact on compression ratio (from configware replication to 52%) and configuration cycles (from 33 nsec to 1.5 secs) for the tested applications. Synthesis results reveal that introducing adaptivity incurs negligible additional overheads (1%) compared to the overall platform area.

  • 26.
    Jafri, Syed M.A.H.
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Tajammul, Adeel
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Paul, Kolin
    Ellervee, Peeter
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics.
    Customizable Compression Architecture for Efficient Configuration in CGRAs2011In: Proceedings: 2014 IEEE 22nd International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, 2011, p. 31-31Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications. Novel CGRAs allow each application to exploit runtime parallelism and time sharing. Although these features enhance the power and silicon efficiency, they significantly increase the configuration memory overheads. As a solution to this problem researchers have employed statistical compression, intermediate compact representation, and multicasting. Each of these techniques has different properties, and is therefore best suited for a particular class of applications. However, existing research only deals with these methods separately. In this paper we propose a morphable compression architecture that interleaves these techniques in a unique platform.

  • 27.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. Turku Centre for Computer Science, Finland; University of Turku, Finland.
    Gia, T.N.
    University of Turku, Finland.
    Dytckov, Sergei
    University of Turku, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Plosila, Juha
    Turku Centre for Computer Science, Finland; University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland.
    NeuroCGRA: A CGRA with support for neural networks2014In: Proceedings of the 2014 International Conference on High Performance Computing and Simulation, HPCS 2014, IEEE , 2014, p. 506-511Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) are becoming an increasingly popular implementation platform. In real world applications, the CGRAs are required to simultaneously host processing (e.g. Audio/video acquisition) and estimation (e.g. audio/video/image recognition) tasks. For estimation problems, neural networks, promise a higher efficiency than conventional processing. However, most of the existing CGRAs provide no support for neural networks. To realize realize both neural networks and conventional processing on the same platform, this paper presents NeuroCGRA. NeuroCGRA allows the processing elements and the network to dynamically morph into either conventional CGRA or a neural network, depending on the hosted application. We have chosen the DRRA as a vehicle to study the feasibility and overheads of our approach. Synthesis results reveal that the proposed enhancements incur negligible overheads (4.4% area and 9.1% power) compared to the original DRRA cell.

  • 28.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Leon, Guillermo Serrano
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Abbas, N.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Indian Institute of Technology.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    TransPar: Transformation based dynamic Parallelism for low power CGRAs2014In: Conference Digest - 24th International Conference on Field Programmable Logic and Applications, FPL 2014, 2014Conference paper (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern applications (e.g. 4G, CDMA, etc.). Recently proposed CGRAs offer runtime parallelism to reduce energy consumption (by lowering voltage/frequency). To implement the runtime parallelism, CGRAs commonly store multiple compile-time generated implementations of an application (with different degree of parallelism) and select the optimal version at runtime. However, the compile-time binding incurs excessive configuration memory overheads and/or is unable to parallelize an application even when sufficient resources are available. As a solution to this problem, we propose Transformation based dynamic Parallelism (TransPar). TransPar stores only a single implementation and applies a series for transformations to generate the bitstream for the parallel version. In addition, it also allows to displace and/or rotate an application to parallelize in resource constrained scenarios. By storing only a single implementation, TransPar offers significant reductions in configuration memory requirements (up to 73% for the tested applications), compared to state of the art compaction techniques. Simulation and synthesis results, using real applications, reveal that the additional flexibility allows up to 33% energy reduction compared to static memory based parallelism techniques. Gate level analysis reveals that TransPar incurs negligible silicon (0.2% of the platform) and timing (6 additional cycles per application) penalty.

  • 29.
    Jafri, Syed Mohammad Asad Hassan
    et al.
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Leon, Guillermo Serrano
    Iqbal, J.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    Paul, Kolin
    Indian Institute of Technology.
    Plosila, Juha
    University of Turku, Finland.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    RuRot: Run-time rotatable-expandable partitions for efficient mapping in CGRAs2014In: Proceedings - International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation, SAMOS 2014, 2014, p. 233-241Conference paper (Refereed)
    Abstract [en]

    Today, Coarse Grained Reconfigurable Architectures (CGRAs) host multiple applications, with arbitrary communication and computation patterns. Compile-time mapping decisions are neither optimal nor desirable to efficiently support the diverse and unpredictable application requirements. As a solution to this problem, recently proposed architectures offer run-time remapping. The run-time remappers displace or expand (parallelize/serialize) an application to optimize different parameters (such as platform utilization). However, the existing remappers support application displacement or expansion in either horizontal or vertical direction. Moreover, most of the works only address dynamic remapping in packet-switched networks and therefore are not applicable to the CGRAs that exploit circuitswitching for low-power and high predictability. To enhance the optimality of the run-time remappers, this paper presents a design framework called Run-time Rotatable-expandable Partitions (RuRot). RuRot provides architectural support to dynamically remap or expand (i.e. parallelize) the hosted applications in CGRAs with circuit-switched interconnects. Compared to state of the art, the proposed design supports application rotation (in clockwise and anticlockwise directions) and displacement (in horizontal and vertical directions), at run-time. Simulation results using a few applications reveal that the additional flexibility enhances the device utilization, significantly (on average 50 % for the tested applications). Synthesis results confirm that the proposed remapper has negligible silicon (0.2 % of the platform) and timing (2 cycles per application) overheads.

  • 30. Kokhazadeh, M.
    et al.
    Kokhazad, Z.
    Dehyadegari, M.
    Daneshtalab, Masoud
    KTH.
    A Novel Two-Step Method for Stereo Vision Algorithm to Reduce Search Space2018In: 26th Iranian Conference on Electrical Engineering, ICEE 2018, Institute of Electrical and Electronics Engineers Inc. , 2018, p. 1681-1686Conference paper (Refereed)
    Abstract [en]

    Stereo vision is a crucial algorithm in depth detection. By comparing images of a scene from two points, the relative position of objects is extracted. Human's vision system uses this relative shift between the left and right eyes to estimate the depth of information. The main goal of stereo vision is to determine the distance between objects in the scene or, in other words, to obtain depth information. This paper presents a two-step method to reduce the runtime and maintain accuracy of the stereo vision algorithm. Due to the data dependency, its implementation in parallel reduces performance. We have implemented this method for the different values of maximum disparity and window sizes. The simulation result shows that the proposed method is more than 6X faster than the common stereo vision. We have also implemented this method using Compute Unified Device Architecture (CUDA) on a Graphics Processing Unit (GPU), and we have shown that due to data dependency, this method does not work well on the Graphics Processing Unit.

  • 31.
    Kokhazadeh, M.
    et al.
    KN Toosi Univ Technol, Dept Comp Engn, Tehran 1631714191, Iran..
    Kokhazad, Z.
    Sharif Branch, Acad Ctr Educ Culture & Res, Dept IT, Tehran 141554364, Iran..
    Dehyadegari, M.
    KN Toosi Univ Technol, Dept Comp Engn, Tehran 1631714191, Iran..
    Daneshtalab, Masoud
    KTH, School of Electrical Engineering and Computer Science (EECS), Electronics, Electronic and embedded systems.
    A novel two-step method for stereo vision to reduce search space2018In: 26TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE 2018), IEEE , 2018, p. 1681-1686Conference paper (Refereed)
    Abstract [en]

    Stereo vision is a crucial algorithm in depth detection. By comparing images of a scene from two points, the relative position of objects is extracted. Human's vision system uses this relative shift between the left and right eyes to estimate the depth of information. The main goal of stereo vision is to determine the distance between objects in the scene or, in other words, to obtain depth information. This paper presents a two-step method to reduce the runtime and maintain accuracy of the stereo vision algorithm. Due to the data dependency, its implementation in parallel reduces performance. We have implemented this method for the different values of maximum disparity and window sizes. The simulation result shows that the proposed method is more than 6X faster than the common stereo vision. We have also implemented this method using Compute Unified Device Architecture (CUDA) on a Graphics Processing Unit (GPU), and we have shown that due to data dependency, this method does not work well on the Graphics Processing Unit.

  • 32. Kokhazadeh, M.
    et al.
    Kokhazad, Z.
    Dehyadegari, M.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics.
    Accelerating stereo vision algorithm using SSE3, AVX2, and CUDA2017In: 2017 25th Iranian Conference on Electrical Engineering, ICEE 2017, Institute of Electrical and Electronics Engineers (IEEE), 2017, p. 2194-2199, article id 7985426Conference paper (Refereed)
    Abstract [en]

    Stereo vision features a widespread usage such as robotics, unmanned cars, aerial surveys, and many real-time applications. Also, it needs computational expensive calculations because of stereo matching. In real time applications, the execution time of stereo vision depth detection algorithm is very important. This paper studies the Intel SIMD instructions and CUDA effects on reducing the execution time of the stereo vision. CUDA and SIMD instructions improve performance by exploiting data level parallelism. We present a fast implementation of SSD stereo vision algorithm on Intel processors using SIMD instruction sets (SSE3 and AVX2) and NVIDIA Graphics Processing Unit (GPU) using CUDA language and compare their results with serial implementation. The algorithm applied to different ranges of disparity (from 16 to 256), window size (from 3×3 to 15×15) and image resolution (from 256×212 to 1408×1168) parameters. We achieved 182 frames per second rate for the disparity of 64 and window size of 3×3 in CUDA, 64 frames per second rate in AVX2 and 25 frames per second rate in SSE3. Experimental results show that we can get speedup up to 5× in SSE3, 10× in AVX2 and 21× in CUDA compared to serial implementation.

  • 33. Maabi, S.
    et al.
    Safaei, F.
    Rezaei, A.
    Daneshtalab, Masoud
    KTH. Mälardalen University, Sweden.
    Zhao, D.
    ERFAN: Efficient reconfigurable fault-tolerant deflection routing algorithm for 3-D Network-on-Chip2017In: International System on Chip Conference, IEEE Computer Society , 2017, p. 306-311Conference paper (Refereed)
    Abstract [en]

    With degradation in transistors dimensions and complication of circuits, Three-Dimensional Network-on-Chip (3-D NoC) is presented as a promising solution in electronic industry. By increasing the number of system components on a chip, the probability of failure will increase. Therefore, proposing fault tolerance mechanisms is an important target in emerging technologies. In this paper, two efficient fault-tolerant routing algorithms for 3-D NoC are presented. The presented algorithms have significant improvement in performance parameters, in exchange for small area overhead. Simulation results show that even with the presence of faults, the network latency is decreased in comparison with state-of-the-art works. In addition, the network reliability is improved reasonably.

  • 34. Majd, A.
    et al.
    Sahebi, G.
    Daneshtalab, Masoud
    Mälardalen University, Sweden.
    Plosila, J.
    Tenhunen, H.
    Hierarchal Placement of Smart Mobile Access Points in Wireless Sensor Networks Using Fog Computing2017In: Proceedings - 2017 25th Euromicro International Conference on Parallel, Distributed and Network-Based Processing, PDP 2017, Institute of Electrical and Electronics Engineers Inc. , 2017, p. 176-180Conference paper (Refereed)
    Abstract [en]

    Recent advances in computing and sensor technologies have facilitated the emergence of increasingly sophisticated and complex cyber-physical systems and wireless sensor networks. Moreover, integration of cyber-physical systems and wireless sensor networks with other contemporary technologies, such as unmanned aerial vehicles (i.e. drones) and fog computing, enables the creation of completely new smart solutions. By building upon the concept of a Smart Mobile Access Point (SMAP), which is a key element for a smart network, we propose a novel hierarchical placement strategy for SMAPs to improve scalability of SMAP based monitoring systems. SMAPs predict communication behavior based on information collected from the network, and select the best approach to support the network at any given time. In order to improve the network performance, they can autonomously change their positions. Therefore, placement of SMAPs has an important role in such systems. Initial placement of SMAPs is an NP problem. We solve it using a parallel implementation of the genetic algorithm with an efficient evaluation phase. The adopted hierarchical placement approach is scalable, it enables construction of arbitrarily large SMAP based systems.

  • 35. Majd, A.
    et al.
    Troubitsyna, E.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics, Electronic and embedded systems.
    Safety-aware control of swarms of drones2017In: Computer Safety, Reliability, and Security: SAFECOMP 2017 Workshops, ASSURE, DECSoS, SASSUR, TELERISE, and TIPS, Trento, Italy, September 12, 2017, Proceedings, Springer, 2017, Vol. 10489, p. 249-260Conference paper (Refereed)
    Abstract [en]

    In this paper, we propose a novel approach to ensuring safety while planning and controlling an operation of swarms of drones. We derive the safety constraints that should be verified both during the mission planning and at the run-time and propose an approach to safety-aware mission planning using evolutionary algorithms. High performance of the proposed algorithm allows us to use it also at run-time to predict and resolve in a safe and optimal way dynamically emerging hazards. The benchmarking of the proposed approach validate its efficiency and safety.

  • 36. Majd, Amin
    et al.
    Abdollahi, Mandi
    Sahebi, Golnaz
    Abdollahi, Davoud
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Plosila, Juha
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Elektronics. Univ Turku, Finland.
    Multi-Population Parallel Imperialist Competitive Algorithm for Solving Systems of Nonlinear Equations2016In: 2016 INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING & SIMULATION (HPCS 2016), IEEE, 2016, p. 767-775Conference paper (Refereed)
    Abstract [en]

    the widespreadimportance of optimization and solving NP-hard problems, like solving systems of nonlinear equations, is indisputable in a diverse range of sciences. Vast uses of non-linear equations are undeniable. Some of their applications are in economics, engineering, chemistry, mechanics, medicine, and robotics. There are different types of methods of solving the systems of nonlinear equations. One of the most popular of them is Evolutionary Computing (EC). This paper presents an evolutionary algorithm that is called Parallel Imperialist Competitive Algorithm (PICA) which is based on a multi population technique for solving systems of nonlinear equations. In order to demonstrate the efficiency of the proposed approach, some well-known problems are utilized. The results indicate that the PICA has a high success and a quick convergence rate.

  • 37. Majd, Amin
    et al.
    Lotfi, Shahriar
    Sahebi, Golnaz
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Plosila, Juha
    PICA: Multi-Population Implementation of Parallel Imperialist Competitive Algorithms2016In: 2016 24TH EUROMICRO INTERNATIONAL CONFERENCE ON PARALLEL, DISTRIBUTED, AND NETWORK-BASED PROCESSING (PDP), Institute of Electrical and Electronics Engineers (IEEE), 2016, p. 248-255Conference paper (Refereed)
    Abstract [en]

    The importance of optimization and NP problems solving cannot be over emphasized. The usefulness and popularity of evolutionary computing methods are also well established. There are various types of evolutionary methods that arc mostly sequential, and some others have parallel implementation. We propose a method to parallelize Imperialist Competitive Algorithm (Multi-Population). The algorithm has been implemented with MPI on two platforms and have tested our algorithms on a shared- memory and message passing architecture. An outstanding performance is obtained, which indicates that the method is efficient concern to speed and accuracy. In the second step, the proposed algorithm is compared with a set of existing well known parallel algorithms and is indicated that it obtains more accurate solutions in a lower time.

  • 38. Majd, Amin
    et al.
    Sahebi, Golnaz
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. Univ. of Turku, Finland.
    Plosila, Juha
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronics. University of Turku, Finland.
    Placement of Smart Mobile Access Points in Wireless Sensor Networks and Cyber-Physical Systems using Fog Computing2016In: Proceedings - 13th IEEE International Conference on Ubiquitous Intelligence and Computing, 13th IEEE International Conference on Advanced and Trusted Computing, 16th IEEE International Conference on Scalable Computing and Communications, IEEE International Conference on Cloud and Big Data Computing, IEEE International Conference on Internet of People and IEEE Smart World Congress and Workshops, UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld 2016Proceedings - 13th IEEE International Conference on Ubiquitous Intelligence and Computing, 13th IEEE International Conference on Advanced and Trusted Computing, 16th IEEE International Conference on Scalable Computing and Communications, IEEE International Conference on Cloud and Big Data Computing, IEEE International Conference on Internet of People and IEEE Smart World Congress and Workshops, UIC-ATC-ScalCom-CBDCom-IoP-SmartWorld 2016, IEEE conference proceedings, 2016, p. 680-689Conference paper (Refereed)
    Abstract [en]

    Increasingly sophisticated, complex, and energy-efficient cyber-physical systems and wireless sensor networks are emerging, facilitated by recent advances in computing and sensor technologies. Integration of cyber-physical systems and wireless sensor networks with other contemporary technologies, such as unmanned aerial vehicles and fog or edge computing, enable creation of completely new smart solutions. We present the concept of a Smart Mobile Access Point (SMAP), which is a key building block for a smart network, and propose an efficient placement approach for such SMAPs. SMAPs predict the behavior of the network, based on information collected from the network, and select the best approach to support the network at any given time. When needed, they autonomously change their positions to obtain a better configuration from the network performance perspective. Therefore, placement of SMAPs is an important issue in such a system. Initial placement of SMAPs is an NP problem, and evolutionary algorithms provide an efficient means to solve it. Specifically, we present a parallel implementation of the imperialistic competitive algorithm and an efficient evaluation or fitness function to solve the initial placement of SMAPs in the fog computing context.

  • 39. Momenzadeh, E.
    et al.
    Modarressi, M.
    Mazloumi, A.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics, Electronic and embedded systems. Mälardalen University (MDH), Sweden.
    Parallel forwarding for efficient bandwidth utilization in networks-on-chip2017In: 30th International Conference on Architecture of Computing Systems, ARCS 2017, Springer Verlag , 2017, p. 152-163Conference paper (Refereed)
    Abstract [en]

    Networks-on-chip (NoC) provide a scalable and power-efficient communication infrastructure for different computing chips, ranging from fully customized multi/many-processor systems-on-chip (MPSoCs) to general-purpose chip multiprocessors (CMPs). A common aspect in almost all NoC workloads is the varying size of data transmitted by each transaction: while large data blocks are transferred as multiple-flit packets, a part of the traffic consists of short data segment (control data) that does not even fill a single flit. In conventional NoCs, switch allocator assigns/ grants a switch output (and the link connected to it) to a single flit at each cycle, even if the flit is shorter than the link bit-width. In this paper, we propose a novel NoC architecture that enables routers to simultaneously send two short flits on the same link, effectively utilizing the link bandwidth that otherwise would be wasted. To this end, new crossbar, virtual channel (VC), and switch allocator architectures are presented to support parallel short packet forwarding on NoC links. Simulation results using synthetic and realistic workloads show that the proposed architecture improves the NoC performance by up to 24%.

  • 40. Ngyen, T.
    et al.
    Jafri, Syed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. Turku Centre for Computer Science, Finland.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland .
    Hemani, Ahmed
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Dytckov, Sergei
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Plosila, Juha
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland .
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku, Finland.
    FIST: A framework to interleave spiking neural networks on CGRAs2015In: Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015, IEEE , 2015, p. 751-758Conference paper (Refereed)
    Abstract [en]

    Coarse Grained Reconfigurable Architectures (CGRAs) are emerging as enabling platforms to meet the high performance demanded by modern embedded applications. In many application domains (e.g. robotics and cognitive embedded systems), the CGRAs are required to simultaneously host processing (e.g. Audio/video acquisition) and estimation (e.g. audio/video/image recognition) tasks. Recent works have revealed that the efficiency and scalability of the estimation algorithms can be significantly improved by using neural networks. However, existing CGRAs commonly employ homogeneous processing resources for both the tasks. To realize the best of both the worlds (conventional processing and neural networks), we present FIST. FIST allows the processing elements and the network to dynamically morph into either conventional CGRA or a neural network, depending on the hosted application. We have chosen the DRRA as a vehicle to study the feasibility and overheads of our approach. Synthesis results reveal that the proposed enhancements incur negligible overheads (4.4% area and 9.1% power) compared to the original DRRA cell.

  • 41. Palesi, M.
    et al.
    Collotta, M.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Bose, P.
    Special issue on energy efficient methods and systems in the emerging cloud era2016In: Journal of computer and system sciences (Print), ISSN 0022-0000, E-ISSN 1090-2724, Vol. 82, no 2, p. 173-173Article in journal (Refereed)
  • 42. Palesi, M.
    et al.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. Univ. of Turku, Finland.
    Wang, X.
    Ebrahimi, M.
    Patti, D.
    Message from the chairs2016In: 9th International Workshop on Network on Chip Architectures, NoCArc 2016, Association for Computing Machinery (ACM), 2016, Vol. 15-October-2016Conference paper (Refereed)
  • 43. Palesi, M.
    et al.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. Univ. of Turku, Finland.
    Wang, X.
    Mehdipour, F.
    Dimitrakopoulos, G.
    Message from the chairs2014Conference paper (Refereed)
  • 44. Palesi, Maurizio
    et al.
    Daneshtalab, Masoud.
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Wang, X.
    Ebrahimi, M.
    Locatelli, R.
    Message from the Chairs2015In: Proceedings of the 8th International Workshop on Network on Chip Architectures, Association for Computing Machinery , 2015, Vol. 05-December-2015Conference paper (Refereed)
  • 45. Phong, N. D. B.
    et al.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland .
    Dytckov, S.
    Plosila, J.
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems. University of Turku, Finland .
    Silicon synapse designs for VLSI neuromorphic platform2014In: NORCHIP 2014 - 32nd NORCHIP Conference: The Nordic Microelectronics Event, IEEE , 2014, p. 7004745-Conference paper (Refereed)
    Abstract [en]

    Analog silicon neurons were proven to be a promising solution for VLSI neuromorphic platform to implement massively scalable computing systems. They possess the advantages of consuming less power and silicon area than digitally designed neurons. This paper compares the differences in power and area consumption between two methods of synapse design for analog neuron models: time-based modulation and current-based modulation. The obtained results demonstrate that under the same technology process (ST CMOS 65nm), the neuron that uses time-based modulation consumes less power (almost six times) and silicon area (about thirty times) but higher energy (twelve times) than that of the current-based modulation.

  • 46. Rezaei, A.
    et al.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Safaei, F.
    Zhao, D.
    Hierarchical approach for hybrid wireless Network-on-chip in many-core era2016In: Computers & electrical engineering, ISSN 0045-7906, E-ISSN 1879-0755, Vol. 51, p. 225-234Article in journal (Refereed)
    Abstract [en]

    Due to high latency and high power consumption in long hops between operational cores of Network-on-Chips (NoCs), the performance of such architectures has been limited. Billions of transistors available on a single chip present opportunities for new levels of computing capability. In order to fill the gap between computing requirements and efficient communications, a new technology called Wireless NoC has been emerged. Employing wireless communication links between cores, wireless NoC has reasonably increased the performance of NoC. However, wireless transceivers along with associated antenna impose considerable area and power overheads in wireless NoCs. Thus, in this paper, we introduce a hybrid wireless NoC called Hierarchical Wireless-based Architecture (HiWA) to use the wireless resources optimally. In the proposed approach the network is divided into subnets where intra-subnet nodes communicate through wire links while inter-subnet communications are handled almost by single-hop wireless links. Simulation results show that HiWA efficiently reduces power consumption by 39% in comparison with a traditional wireless NoC, called WiNoC, while still achieves 16% lower packet latency than conventional NoC.

  • 47. Rezaei, A.
    et al.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics, Electronic and embedded systems.
    Zhao, D.
    CAP-W: Congestion-aware platform for wireless-based network-on-chip in many-core era2017In: Microprocessors and microsystems, ISSN 0141-9331, E-ISSN 1872-9436, Vol. 52, p. 23-33Article in journal (Refereed)
    Abstract [en]

    In order to fulfill the ever-increasing demand for high-speed and high-bandwidth, wireless-based MCSoC is presented based on a NoC communication infrastructure. Inspiring the separation between the communication and the computation demands as well as providing the flexible topology configurations, makes wireless-based NoC a promising future MCSoC architecture. However, congestion occurrence in wireless routers reduces the benefit of high-speed wireless links and significantly increases the network latency. Therefore, in this paper, a congestion-aware platform, named CAP-W, is introduced for wireless-based NoC in order to reduce congestion in the network and especially over wireless routers. The triple-layer platform of CAP-W is composed of mapping, migration, and routing layers. In order to minimize the congestion probability, the mapping layer is responsible for selecting the suitable free core as the first candidate, finding the suitable first task to be mapped onto the selected core, and allocating other tasks with respect to contiguity. Considering dynamic variation of application behaviors, the migration layer modifies the primary task mapping to improve congestion situation. Furthermore, the routing layer balances utilization of wired and wireless networks by separating short-distance and long-distance communications. Experimental results show meaningful gain in congestion control of wireless-based NoC compared to state-of-the-art works.

  • 48. Rezaei, A.
    et al.
    Daneshtalab, Masoud
    KTH. Mälardalen University, Sweden.
    Zhao, D.
    Modarressi, M.
    SAMi: Self-aware migration approach for congestion reduction in NoC-based MCSoC2017In: International System on Chip Conference, IEEE Computer Society , 2017, p. 145-150Conference paper (Refereed)
    Abstract [en]

    Many-Core System-on-Chips (MCSoCs) require efficient task migration approach in order to reach system performance objectives such as load balancing, communication optimization, fault tolerance, and temperature control. In this paper an efficient self-aware migration approach is introduced for NoC-based MCSoCs using a centralized feedback controller in order to control the congestion over the system. The proposed approach is divided into four main steps: predicting behavior of the application, defining reliable triggers to initiate task migration, introducing cost comparison functions, and presenting a streamlined controlling mechanism to migrate tasks. The experimental results affirm that the proposed self-aware migration approach can help achieving significant throughput and system utilization while efficiently controlling system congestion.

  • 49. Rezaei, A.
    et al.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems.
    Zhao, D.
    Safaei, F.
    Wang, X.
    Ebrahimi, Masoumeh
    KTH, School of Information and Communication Technology (ICT), Industrial and Medical Electronics.
    Dynamic application mapping algorithm for wireless network-on-chip2015In: Proceedings - 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP 2015, IEEE Press, 2015, p. 421-424Conference paper (Refereed)
    Abstract [en]

    Because of high bandwidth, low latency and flexible topology configurations provided by wireless NoC, this emerging technology is gaining momentum to be a promising future on-chip interconnection paradigm. However, congestion occurrence in wireless routers reduces the benefit of high speed wireless links and significantly increases the network latency; therefore, in this paper, a Dynamic Application Mapping Algorithm (DAMA) is introduced for wireless NoCs in order to reduce both internal and external congestion. DAMA has three key steps: finding the first node to map, choosing the first task to be mapped onto the first node, and allocation of the remaining tasks to the remaining nodes. Simulation results show significant gain in the mapping cost functions compared to state-of-the-art works.

  • 50. Rezaei, A.
    et al.
    Safaei, F.
    Daneshtalab, Masoud
    KTH, School of Information and Communication Technology (ICT), Electronics and Embedded Systems. University of Turku (UTU), Finland .
    Tenhunen, Hannu
    KTH, School of Information and Communication Technology (ICT), Electronic Systems.
    HiWA: A hierarchical Wireless Network-on-Chip architecture2014In: Proceedings of the 2014 International Conference on High Performance Computing and Simulation, HPCS 2014, 2014, p. 499-505Conference paper (Refereed)
    Abstract [en]

    Due to high latency and high power consumption in long hops between operational cores of NoCs, the performance of such architectures has been limited. In order to fill the gap between computing requirements and efficient communications, a new technology called Wireless Network-on-Chip (WNoC) has been emerged. Employing wireless communication links between cores, the new technology has reasonably increased the performance of NoC. However, wireless transceivers along with associated antenna impose considerable area and power overheads in WNoCs. Thus, in this paper, we introduce a hierarchical WNoC called Hierarchical Wireless-based Architecture (HiWA) to use the wireless resources optimally. In the proposed approach the network is divided into subnets where intra-subnet nodes communicate through wire links while inter-subnet communications are almost handled by single-hop wireless links. On top of that, we have also defined performance evaluation parameters. Simulation results show that the proposed architecture reduces average packet latency 16% and power consumption 14% in comparison with its conventional counterparts.

12 1 - 50 of 63
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf