kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
High-Level Synthesis for SiLago: Advances in Optimization of High-Level Synthesis Tool and Neural Network Algorithms
KTH, School of Electrical Engineering and Computer Science (EECS), Electrical Engineering, Electronics and Embedded systems, Electronic and embedded systems.ORCID iD: 0000-0003-2396-3590
2022 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Embedded hardware designs and their automation improve energy and engineering efficiency. However, these two goals are often contradictory. The attempts to improve energy efficiency often come at the cost of engineering efficiency and vice-versa. High-level synthesis (HLS) is a good example of this challenge. It has been researched for more than three decades. Nevertheless, it has not become a mainstream design flow component concerning custom hardware synthesis due to the big efficiency gap between the HLS-generated hardware design and the manual RTL design.

This thesis attempts to address the HLS challenge. We divide the research challenge of improving state-of-the-art HLS into three components: 1) the hardware architecture and its underlying VLSI design style, 2) the design automation algorithms and data structures, and 3) the optimization of the algorithm to be mapped.

The SiLago hardware platform has been reported as a prominent hardware architecture that can deliver ASIC-like efficiency and could be an ideal HLS hardware platform. It has the following features: 1) SiLago embodies parallel distributed two-level control. 2) SiLago blocks are hardened blocks that can create valid VLSI designs by abutment without involving logic or physical synthesis.

Consequently, when targeting the SiLago hardware platform, the SiLago HLS tool generates not a single controller but multiple collaborative controllers, each of which is a hierarchy of two levels. The distributed two-level control scheme poses unique challenges in synchronization and scheduling tasks. Unique data structures and instruction scheduling models are developed for the SiLago HLS tool to support the distributed two-level control scheme. The SiLago HLS tool also generates a valid GDSII macro whose average energy, area, and performance are not estimated but known with post-layout accuracy thanks to the predictable SiLago hardware blocks. Moreover, the SiLago HLS tool is not intended for the end-user. It is designed to develop a library of algorithm implementations used by the application-level synthesis (ALS) tool in the SiLago framework. The application is defined as a hierarchy of algorithms. This library would include algorithms that vary in their function, dimension, and degree of parallelism. The ALS tool explores the design space in terms of number and type of algorithm implementation, rather than arithmetic resources, as HLS tools do.

Algorithms are often developed by domain experts. For efficient implementation in hardware, such algorithms often need to be optimized with the hardware platform in mind. Two algorithm instances have been chosen for demonstration purposes. The first instance is a self-organizing map (SOM) based genome recognition algorithm. The second example concerns a complex model of cortex called Bayesian confidence propagation neural network (BCPNN). As developed by computational neuroscientists, the original model demands too much memory storage and memory access.

This thesis addresses the latter two components because the first component has been addressed in the literature. We will first demonstrate the design of the SiLago HLS tool to support the hardware features like the distributed two-level control system. Moreover, we will use the two complex algorithm instances -- SOM and BCPNN, to demonstrate both general-purpose and algorithm-specific hardware-oriented algorithm optimization techniques. With the research carried out in this thesis, the SiLago HLS framework is greatly improved.

Abstract [sv]

Automatiseringen av inbyggda system ökar ingenjörers produktivitet och minskar systemens energiförbrukning. Målen är ofta motstridiga då högre ingenjörsproduktivitet sker på bekostnad av energiförbrukning och vice versa. Högnivåsyntes (high-level synthesis, HLS) exemplifierar dilemmat. Trots att det forskats i mer än tre decennier på HLS har inte designmetodiken blivit mainstream inom elektronikdesign på grund av det stora effektivitetsgapet mellan den HLS-genererade och den manuellt skapade RTL-designen.

Avhandlingen handlar om detta dilemma. Forskningsutmaningarna kring att förbättra HLS avhandlar vi i tre delar: 1) hårdvaruarkitektur och underliggande VLSI-designmetodik, 2) elektronikdesignens algoritmer och datastruktuer och 3) optimering av den algoritm som ska implementeras.

SiLago-plattformen har visats vara en framstående hårdvaruarkitektur som kan uppnå ASIC-liknande prestanda samtidigt som den är en idealisk för HLS. Plattformen har följande egenskaper: 1) SiLago förkroppsligar den parallella distribuerade två-nivåskontrollparadigmen, 2) SiLago-komponenter är försyntetiserade med vilka funktionella VLSI-designer kan skapas genom hopfogning utan ytterligare logisk eller fysisk syntes.

Därför skapar inte SiLagos HLS-verktyg en enda controller utan flera stycken samverkande controllers. Var och en av dessa består av en tvånivåshierarki. Detta medför unika synkroniserings- och schemaläggningsutmaningar. Unika datastrukturer och schemaläggningsmodeller har utvecklats för SiLago HLS för att stödja denna två-nivåskontrollparadigm. Därutöver skapar SiLagos HLS-verktyg GDSII-makron vars genomsnittliga energiförbrukning, yta och prestanda inte uppskattas utan bestäms med post-layout precision tack vare SiLagos försyntetiserade komponenter. Målgruppen för SiLagos HLS-verktyg är inte slutanvändare utan utvecklare som utvecklar algoritmbibliotek som sedan används av SiLagos applikationssyntesverktyg (application level synthesis, ALS). Applikationen ses som en hierarki av algoritmer. Biblioteket kan innehålla algoritmer vars egenskaper skiljer, såsom olika funktioner, dimensioner och parallelliseringsmöjligheter. ALS-verktyget utforskar designrymden i termer av antal och algoritmtyper, istället för aritmetiska resurser, som konkurrerande HLS-verktyg gör.

Algoritmer utvecklas ofta av sakkunniga. För att de ska kunna realiseras i hårdvara måste de optimeras med målplattformen i åtanke. Två algoritmer har valts som demonstrationsexempel. Det första exemplet är en genidentifieringsalgoritm som bygger på en självorganiserande karta (self-organizing map, SOM). Det andra exemplet är en avancerad modell av hjärnbarken, ett bayesiskt neuralt överföringsnätverk (Bayesian confidence propagation network, BCPNN). Då modellen utvecklats av beräkningsneurovetenskapspersoner kräver den för mycket lagring och överföringskapacitet.

Avhandlingen handlar om de två sistnämnda delarna eftersom den förstnämnda redan avhandlats av andra. Vi visar hur SiLago HLS stödjer distribuerade tvånivåskontrollsystem. Därutöver och med de två nämnda algoritmexemplen - SOM och BCPNN - demonstrerar vi algoritmspecifika och plattformsspecifika optimeringstekniker. Forskningen som beskrivs i avhandlingen har signifikant förbättrat SiLagos HLS-ramverk.

Place, publisher, year, edition, pages
Sweden: KTH Royal Institute of Technology, 2022. , p. 76
Series
TRITA-EECS-AVL ; 2022:48
Keywords [en]
Electronic Design Automation (EDA), Computer Aided Design (CAD), Algorithm-level Synthesis, SiLago, Optimization Techniques, Neural Network
National Category
Embedded Systems
Research subject
Information and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-317555ISBN: 978-91-8040-300-9 (print)OAI: oai:DiVA.org:kth-317555DiVA, id: diva2:1695389
Public defence
2022-10-06, Ka-Sal C, Electrum, Kungliga Tekniska Högskolan, Kistagången 16, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

QC 20220914

Available from: 2022-09-14 Created: 2022-09-13 Last updated: 2022-09-14Bibliographically approved
List of papers
1. Vesyla-II: An Algorithm Library Development Tool for Synchoros VLSI Design Style
Open this publication in new window or tab >>Vesyla-II: An Algorithm Library Development Tool for Synchoros VLSI Design Style
(English)Manuscript (preprint) (Other academic)
Abstract [en]

High-level synthesis (HLS) has been researched for decades and is still limited to fast FPGA prototyping and algorithmic RTL generation. A feasible end-to-end system-level synthesis solution has never been rigorously proven. Modularity and composability are the keys to enabling such a system-level synthesis framework that bridges the huge gap between system-level specification and physical level design. It implies that 1) modules in each abstraction level should be physically composable without any irregular glue logic involved and 2) the cost of each module in each abstraction level is accurately predictable. The ultimate reasons that limit how far the conventional HLS can go are precisely that it cannot generate modular designs that are physically composable and cannot accurately predict the cost of its design. In this paper, we propose Vesyla, not as yet another HLS tool, but as a synthesis tool that positions itself in a promising end-to-end synthesis framework and preserving its ability to generate physically composable modular design and to accurately predict its cost metrics. We present in the paper how Vesyla is constructed focusing on the novel platform it targets and the internal data structures that highlights the uniqueness of Vesyla. We also show how Vesyla will be positioned in the end-to-end synchoros synthesis framework called SiLago. 

Keywords
Synchoros VLSI Design, High-level Synthesis, CGRA, Design Space Exploration, Two-level Control
National Category
Embedded Systems
Identifiers
urn:nbn:se:kth:diva-317476 (URN)
Note

QC 20220914

Available from: 2022-09-12 Created: 2022-09-12 Last updated: 2022-09-14Bibliographically approved
2. Scheduling Persistent and Fully Cooperative Instructions
Open this publication in new window or tab >>Scheduling Persistent and Fully Cooperative Instructions
2021 (English)In: 2021 24TH EUROMICRO CONFERENCE ON DIGITAL SYSTEM DESIGN (DSD 2021) / [ed] Leporati, F Vitabile, S Skavhaug, A, Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 229-237Conference paper, Published paper (Refereed)
Abstract [en]

Parallel, distributed two-level control system has been adopted in streaming application accelerators that implement atomic vector operations. Each instruction of such architecture deals with one aspect (arithmetic, interconnect, storage, etc.) of an atomic vector operation. Such instructions are persistent and fully cooperative. Their lifetimes vary because of the vector size and the degree of parallelism. More complex constraints are also required to express the cooperation among these instructions. The conventional instruction behavior models are no longer suitable for such instructions. Therefore, we develop a novel instruction behavior model to address the scheduling aspect of the instruction set required by such architecture. Based on the behavior model, we formally define the scheduling problem and formulate it as a constraint satisfaction optimization problem (CSOP). However, the naive CSOP formulation quickly becomes unscalable. Thus a heuristic enhanced scheduling algorithm is introduced to make the CSOP approach scalable. The enhanced algorithm's scalability is validated by a large set of experiments varying in problem size.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
Instruction scheduling, CGRA, Two-level control, Constraint programming
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-307013 (URN)10.1109/DSD53832.2021.00044 (DOI)000728394500035 ()2-s2.0-85125768770 (Scopus ID)
Conference
24th Euromicro Conference on Digital System Design (DSD), SEP 01-03, 2021, Palermo, ITALY
Note

Part of proceedings ISBN 978-1-6654-2703-6

Not duplicate with DiVA 1588102 which has the same title but is a different conference.

QC 20220112

Available from: 2022-01-12 Created: 2022-01-12 Last updated: 2022-09-13Bibliographically approved
3. Reducing the Configuration Overhead of the Distributed Two-level Control System
Open this publication in new window or tab >>Reducing the Configuration Overhead of the Distributed Two-level Control System
2022 (English)In: PROCEEDINGS OF THE 2022 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2022), IEEE, 2022, p. 104-107Conference paper, Published paper (Refereed)
Abstract [en]

With the growing demand for more efficient hardware accelerators for streaming applications, a novel Coarse-Grained Reconfigurable Architecture (CGRA) that uses a DistributedTwo-Level Control (D2LC) system has been proposed in the literature. Even though the highly distributed and parallel structure makes it fast and energy-efficient, the single-issue instruction channel between the level-1 and level-2 controller in each D2LC cell becomes the bottleneck of its performance. In this paper, we improve its design to mimic a multi-issued architecture by inserting shadow instruction buffers between the level-1 and level-2 controllers. Together with a zero-overhead hardware loop, the improved D2LC architecture can enable efficient overlap between loop iterations. We also propose a complete constraint programming based instruction scheduling algorithm to support the above hardware features. The experiment result shows that the improved D2LC architecture can achieve up to 25% of reduction on the instruction execution cycles and 35% reduction on the energy-delay product.

Place, publisher, year, edition, pages
IEEE, 2022
Series
Design Automation and Test in Europe Conference and Exhibition, ISSN 1530-1591
Keywords
Loop acceleration, Instruction scheduling, CGRA, Two-level control, Constraint programming
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-311689 (URN)10.23919/date54114.2022.9774741 (DOI)000819484300024 ()2-s2.0-85130840236 (Scopus ID)
Conference
25th Design, Automation and Test in Europe Conference and Exhibition (DATE), 14-23 Mars, 2022
Note

Part of proceedings: ISBN 978-3-9819263-6-1

QC 20220503

QC 20220121

Available from: 2022-05-02 Created: 2022-05-02 Last updated: 2023-02-21Bibliographically approved
4. RiBoSOM: Rapid bacterial genome identification using self-organizing map implemented on the synchoros SiLago platform
Open this publication in new window or tab >>RiBoSOM: Rapid bacterial genome identification using self-organizing map implemented on the synchoros SiLago platform
Show others...
2018 (English)In: ACM International Conference Proceeding Series, Association for Computing Machinery (ACM), 2018, p. 105-114Conference paper, Published paper (Refereed)
Abstract [en]

Artificial Neural Networks have been applied to many traditional machine learning applications in image and speech processing. More recently, ANNs have caught attention of the bioinformatics community for their ability to not only speed up by not having to assemble genomes but also work with imperfect data set with duplications. ANNs for bioinformatics also have the added attraction of better scaling for massive parallelism compared to traditional bioinformatics algorithms. In this paper, we have adapted Self-organizing Maps for rapid identification of bacterial genomes called BioSOM. BioSOM has been implemented on a design of two coarse grain reconfigurable fabrics customized for dense linear algebra and streaming scratchpad memory respectively. These fabrics are implemented in a novel synchoros VLSI design style that enables composition by abutment. The synchoricity empowers rapid and accurate synthesis from Matlab models to create near ASIC like efficient solution. This platform, called SiLago (Silicon Lego) is benchmarked against a GPU implementation. The SiLago implementation of BioSOMs in four different dimensions, 128, 256, 512 and 1024 Neurons, were trained for two E Coli strains of bacteria with 40K training vectors. The results of SiLago implementation were benchmarked against a GPU GTX 1070 implementation in the CUDA framework. The comparison reveals 4 to 140X speed up and 4 to 5 orders of improvement in energy-delay product compared to implementation on GPU. This extreme efficiency comes with the added benefit of automated generation of GDSII level design from Matlab by using the Synchoros VLSI design style.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2018
Series
ACM International Conference Proceeding Series
Keywords
Neural networks, Self-Organizing Maps, SiLago, Synchoros VLSI Design, Parallel architecture, 3D DRAM, GPU
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-247206 (URN)10.1145/3229631.3229650 (DOI)000475843000013 ()2-s2.0-85060986330 (Scopus ID)9781450364942 (ISBN)
Conference
18th Annual International conference on Embedded Computer Systems: Architectures, MOdeling and Simulation, SAMOS 2018, 15 July 2018 through 19 July 2018
Note

QC 20190416

Available from: 2019-04-16 Created: 2019-04-16 Last updated: 2024-07-23Bibliographically approved
5. Approximate Computing Applied to Bacterial Genome Identification using Self-Organizing Maps
Open this publication in new window or tab >>Approximate Computing Applied to Bacterial Genome Identification using Self-Organizing Maps
Show others...
2019 (English)In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), IEEE Computer Society, 2019, p. 560-567, article id 8839522Conference paper, Published paper (Refereed)
Abstract [en]

In this paper we explore the design space of a self-organizing map (SOM) used for rapid and accurate identification of bacterial genomes. This is an important health care problem because even in Europe, 70% of prescriptions for antibiotics is wrong. The SOM is trained on Next Generation Sequencing (NGS) data and is able to identify the exact strain of bacteria. This is in contrast to conventional methods that require genome assembly to identify the bacterial strain. SOM has been implemented as an synchoros VLSI design and shown to have 3-4 orders better computational efficiency compared to GPUs. To further lower the energy consumption, we exploit the robustness of SOM by successively lowering the resolution to gain further improvements in efficiency and lower the implementation cost without substantially sacrificing the accuracy. We do an in depth analysis of the reduction in resolution vs. loss in accuracy as the basis for designing a system with the lowest cost and acceptable accuracy using NGS data from samples containing multiple bacteria from the labs of one of the co-authors. The objective of this method is to design a bacterial recognition system for battery operated clinical use where the area, power and performance are of critical importance. We demonstrate that with 39% loss in accuracy in 12 bits and 1% in 16 bit representation can yield significant savings in energy and area.

Place, publisher, year, edition, pages
IEEE Computer Society, 2019
National Category
Embedded Systems
Identifiers
urn:nbn:se:kth:diva-263799 (URN)10.1109/ISVLSI.2019.00106 (DOI)000538332100097 ()2-s2.0-85072991757 (Scopus ID)
Conference
18th IEEE Computer Society Annual Symposium on VLSI, ISVLSI 2019; Miami; United States; 15 July 2019 through 17 July 2019
Note

QC 20191115

Part of ISBN 978-1-7281-3391-1

Available from: 2019-11-14 Created: 2019-11-14 Last updated: 2024-10-15Bibliographically approved
6. Optimizing BCPNN Learning Rule for Memory Access
Open this publication in new window or tab >>Optimizing BCPNN Learning Rule for Memory Access
Show others...
2020 (English)In: Frontiers in Neuroscience, ISSN 1662-4548, E-ISSN 1662-453X, Vol. 14, article id 878Article in journal (Refereed) Published
Abstract [en]

Simulation of large scale biologically plausible spiking neural networks, e.g., Bayesian Confidence Propagation Neural Network (BCPNN), usually requires high-performance supercomputers with dedicated accelerators, such as GPUs, FPGAs, or even Application-Specific Integrated Circuits (ASICs). Almost all of these computers are based on the von Neumann architecture that separates storage and computation. In all these solutions, memory access is the dominant cost even for highly customized computation and memory architecture, such as ASICs. In this paper, we propose an optimization technique that can make the BCPNN simulation memory access friendly by avoiding a dual-access pattern. The BCPNN synaptic traces and weights are organized as matrices accessed both row-wise and column-wise. Accessing data stored in DRAM with a dual-access pattern is extremely expensive. A post-synaptic history buffer and an approximation function thus are introduced to eliminate the troublesome column update. The error analysis combining theoretical analysis and experiments suggests that the probability of introducing intolerable errors by such optimization can be bounded to a very small number, which makes it almost negligible. Derivation and validation of such a bound is the core contribution of this paper. Experiments on a GPU platform shows that compared to the previously reported baseline simulation strategy, the proposed optimization technique reduces the storage requirement by 33%, the global memory access demand by more than 27% and DRAM access rate by more than 5%; the latency of updating synaptic traces decreases by roughly 50%. Compared with the other similar optimization technique reported in the literature, our method clearly shows considerably better results. Although the BCPNN is used as the targeted neural network model, the proposed optimization method can be applied to other artificial neural network models based on a Hebbian learning rule.

Place, publisher, year, edition, pages
Frontiers Media SA, 2020
Keywords
Bayesian Confidence Propagation Neural Network (BCPNN), neuromorphic computing, Hebbian learning, spiking neural networks, memory optimization, DRAM, cache, digital neuromorphic hardware
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-283870 (URN)10.3389/fnins.2020.00878 (DOI)000570682100001 ()32982673 (PubMedID)2-s2.0-85090894543 (Scopus ID)
Note

QC 20210527

Available from: 2020-11-26 Created: 2020-11-26 Last updated: 2024-03-15Bibliographically approved
7. Approximate computation of post-synaptic spikes reduces bandwidth to synaptic storage in a model of cortex
Open this publication in new window or tab >>Approximate computation of post-synaptic spikes reduces bandwidth to synaptic storage in a model of cortex
2021 (English)In: PROCEEDINGS OF THE 2021 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2021), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 685-688Conference paper, Published paper (Refereed)
Abstract [en]

The Bayesian Confidence Propagation Neural Network (BCPNN) is a spiking model of the cortex. The synaptic weights of BCPNN are organized as matrices. They require substantial synaptic storage and a large bandwidth to it. The algorithm requires a dual access pattern to these matrices, both row-wise and column-wise, to access its synaptic weights. In this work, we exploit an algorithmic optimization that eliminates the column-wise accesses. The new computation model approximates the post-synaptic spikes computation with the use of a predictor. We have adopted this approximate computational model to improve upon the previously reported ASIC implementation, called eBrainII. We also present the error analysis of the approximation to show that it is negligible. The reduction in storage and bandwidth to the synaptic storage results in a 48% reduction in energy compared to eBrainII. The reported approximation method also applies to other neural network models based on a Hebbian learning rule.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
3D DRAM, Approximate computing, ASIC, Bandwidth optimization, Neuromorphic Hardware, Backpropagation, Bandwidth, Computation theory, Matrix algebra, Access patterns, Algorithmic optimization, Approximate computation, Approximation methods, Computation model, Computational model, Hebbian learning, Neural network model, Neural networks
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-310721 (URN)10.23919/DATE51398.2021.9474192 (DOI)000805289900127 ()2-s2.0-85111009925 (Scopus ID)
Conference
2021 Design, Automation and Test in Europe Conference and Exhibition, DATE 2021, 1-5 February 2021, Grenoble, France
Note

Part of proceedings ISBN: 978-3-9819263-5-4

QC 20220413

Available from: 2022-04-13 Created: 2022-04-13 Last updated: 2022-09-13Bibliographically approved

Open Access in DiVA

summary(8520 kB)681 downloads
File information
File name FULLTEXT01.pdfFile size 8520 kBChecksum SHA-512
9c508342eb7d6fdf36e1fbe36728d16f1b5439041105a60c62f1eb6023766cf42c508238750532a6a71a66a2d5a6769113dbbf5aaef283d1c1a1b734e5297225
Type fulltextMimetype application/pdf

Authority records

Yang, Yu

Search in DiVA

By author/editor
Yang, Yu
By organisation
Electronic and embedded systems
Embedded Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 682 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 989 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf