Change search
ReferencesLink to record
Permanent link

Direct link
The Upset-Fault-Observer: A Concept for Self-healing Adaptive Fault Tolerance
KTH, School of Information and Communication Technology (ICT), Electronic Systems.
KTH, School of Information and Communication Technology (ICT), Electronic Systems.ORCID iD: 0000-0002-8072-1742
KTH, School of Information and Communication Technology (ICT), Electronic Systems.ORCID iD: 0000-0003-4859-3100
2014 (English)In: Proceedings of the 2014 NASA/ESA Conference on Adaptive Hardware and Systems, AHS 2014, IEEE Computer Society, 2014, 89-96 p.Conference paper (Refereed)
Abstract [en]

Advancing integration reaching atomic-scales makes components highly defective and unstable during lifetime. This demands paradigm shifts in electronic systems design. FPGAs are particularly sensitive to cosmic and other kinds of radiations that produce single-event-upsets (SEU) in configuration and internal memories. Typical fault-tolerance (FT) techniques combine triple-modular-redundancy (TMR) schemes with run-time-reconfiguration (RTR). However, even the most successful approaches disregard the low suitability of fine-grain redundancy in nano-scale design, poor scalability and programmability of application specific architectures, small performance-consumption ratio of board-level designs, or scarce optimization capability of rigid redundancy structures. In that context, we introduce an innovative solution that exploits the flexibility, reusability, and scalability of a modular RTR SoC approach and reuse existing RTR IP-cores in order to assemble different TMR schemes during run-time. Thus, the system can adaptively trigger the adequate self-healing strategy according to execution environment metrics and user-defined goals. Specifically the paper presents: (a) the upset-fault-observer (UFO), an innovative run-time self-test and recovery strategy that delivers FT on request over several function cores but saves the redundancy scalability cost by running periodic reconfigurable TMR scan-cycles, (b) run-time reconfigurable TMR schemes and self-repair mechanisms, and (c) an adaptive software organization model to manage the proposed FT strategies.

Place, publisher, year, edition, pages
IEEE Computer Society, 2014. 89-96 p.
, NASA/ESA Conference on Adaptive Hardware and Systems, ISSN 1939-7003
Keyword [en]
partial and run-time-reconfiguration, fault-tolerance, self-healing, self-configuration, system-on-chip, hardware systems, reconfigurable IP-cores, adaptive embedded systems, reconfigurable computing
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
URN: urn:nbn:se:kth:diva-158304DOI: 10.1109/AHS.2014.6880163ISI: 000345896600013ScopusID: 2-s2.0-84906705557ISBN: 978-1-4799-5356-1OAI: diva2:776285
2014 NASA/ESA Conference on Adaptive Hardware and Systems, AHS 2014, Leicester, United Kingdom, 14 July 2014 through 18 July 2014

QC 20150107

Available from: 2015-01-07 Created: 2015-01-07 Last updated: 2015-12-01Bibliographically approved
In thesis
1. Cognitive and Self-Adaptive SoCs with Self-Healing Run-Time-Reconfigurable RecoBlocks
Open this publication in new window or tab >>Cognitive and Self-Adaptive SoCs with Self-Healing Run-Time-Reconfigurable RecoBlocks
2015 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In contrast to classical Field-Programmable Gate Arrays (FPGAs), partial and run-time reconfigurable (RTR) FPGAs can selectively reconfigure partitions of its hardware almost immediately while it is still powered and operative. In this way, RTR FPGAs combine the flexibility of software with the high efficiency of hardware. However, their potential cannot be fully exploited due to the increased complexity of the design process, and the intricacy to generate partial reconfigurations. FPGAs are often seen as a single auxiliary area to accelerate algorithms for specific problems. However, when several RTR partitions are implemented and combined with a processor system, new opportunities and challenges appear due to the creation of a heterogeneous RTR embedded system-on-chip (SoC).

The aim of this thesis is to investigate how the flexibility, reusability, and productivity in the design process of partial and RTR embedded SoCs can be improved to enable research and development of novel applications in areas such as hardware acceleration, dynamic fault-tolerance, self-healing, self-awareness, and self-adaptation. To address this question, this thesis proposes a solution based on modular reconfigurable IP-cores and design-and-reuse principles to reduce the design complexity and maximize the productivity of such FPGA-based SoCs. The research presented in this thesis found inspiration in several related topics and sciences such as reconfigurable computing, dependability and fault-tolerance, complex adaptive systems, bio-inspired hardware, organic and autonomic computing, psychology, and machine learning.

The outcome of this thesis demonstrates that the proposed solution addressed the research question and enabled investigation in initially unexpected fields. The particular contributions of this thesis are: (1) the RecoBlock SoC concept and platform with its flexible and reusable array of RTR IP-cores, (2) a simplified method to transform complex algorithms modeled in Matlab into relocatable partial reconfigurations adapted to an improved RecoBlock IP-core architecture, (3) the self-healing RTR fault-tolerant (FT) schemes, especially the Upset-Fault-Observer (UFO) that reuse available RTR IP-cores to self-assemble hardware redundancy during runtime, (4) the concept of Cognitive Reconfigurable Hardware (CRH) that defines a development path to achieve self-adaptation and cognitive development, (5) an adaptive self-aware and fault-tolerant RTR SoC that learns to adapt the RTR FT schemes to performance goals under uncertainty using rule-based decision making, (6) a method based on online and model-free reinforcement learning that uses a Q-algorithm to self-optimize the activation of dynamic FT schemes in performance-aware RecoBlock SoCs.

The vision of this thesis proposes a new class of self-adaptive and cognitive hardware systems consisting of arrays of modular RTR IP-cores. Such a system becomes self-aware of its internal performance and learns to self-optimize the decisions that trigger the adequate self-organization of these RTR cores, i.e., to create dynamic hardware redundancy and self-healing, particularly while working in uncertain environments.

Abstract [sv]

Partiell och run-time rekonfigurering (RTR) betyder att en del av en integrerad krets kan konfigureras om, medan den resterande delens operation kan fortlöpa. Moderna Field Programmable Gate Array (FPGA) kretsar är ofta partiell och run-time rekonfigurerbara och kombinerar därmed mjukvarans flexibilitet med hårdvarans effektivitet. Tyvärr hindrar dock den ökade designkomplexiteten att utnyttja dess fulla potential. Idag ses FPGAer mest som hårdvaruacceleratorer, men helt nya möjligheter uppstår genom att kombinera ett multiprocessorsystem med flera rekonfigurerbara partitioner som oberoende av varandra kan omkonfigureras under systemoperation.

Målet med avhandlingen är att undersöka hur utvecklingsprocessen för partiella och run-time rekonfigurerbara FPGAer kan förbättras för att möjliggöra forskning och utveckling av nya tillämpningar i områden som hårdvaruacceleration, själv-läkande och själv-adaptiva system. I avhandlingen föreslås att en lösning baserad på modulära rekonfigurerbara hårdvarukärnor kombinerad med principer för återanvändbarhet kan förenkla komplexiteten av utvecklingsprocessen och leda till en högre produktivitet vid utvecklingen av inbyggda run-time rekonfigurerbara system. Forskningen i avhandlingen inspirerades av flera relaterade områden, så som rekonfigurerbarhet, tillförlitlighet och feltolerans, komplexa adaptiva system, bio-inspirerad hårdvara, organiska och autonoma system, psykologi och maskininlärning.

Avhandlingens resultat visar att den föreslagna lösningen har potential inom olika tillämpningsområden. Avhandlingen har följande bidrag: (1) RecoBlock system-på-kisel plattformen bestående av flera rekonfigurerbara hårdvarukärnor, (2) en förenklad metod för att implementera Matlab modeller i rekonfigurerbara partitioner, (3) metoder för själv-läkande RTR feltoleranta system, t. ex. Upset-Fault-Observer, som själv-skapar hårdvaruredundans under operation, (4) utvecklandet av konceptet för kognitiv rekonfigurerbar hårdvara, (5) användningen av konceptet och plattformen för att implementera kretsar som kan användas i en okänd omgivning på grund av förmågan att fatta regel-baserade beslut, och (6) en förstärkande inlärnings-metod som använder en Q-algoritm för dynamisk feltolerans i prestanda-medvetna RecoBlock SoCs.

Avhandlingens vision är en ny klass av själv-adaptiva och kognitiva hårdvarusystem bestående av modulära run-time rekonfigurerbara hårdvarukärnor. Dessa system blir själv-medvetna om sin interna prestanda och kan genom inlärning optimera sina beslut för själv-organisation av de rekonfigurerbara kärnorna. Därmed skapas dynamisk hårdvaruredundans och självläkande system som har bättre förutsättningar att kunna operera i en okänd omgivning.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2015. xiv, 83 p.
TRITA-ICT-ECS AVH, ISSN 1653-6363 ; 15:22
National Category
Electrical Engineering, Electronic Engineering, Information Engineering Embedded Systems Computer Systems
urn:nbn:se:kth:diva-178000 (URN)978-91-7595-768-5 (ISBN)
Public defence
2015-12-17, Sal C, Elektrum, KTH-ICT, Kista, 13:00 (English)

QC 20151201

Available from: 2015-12-01 Created: 2015-12-01 Last updated: 2015-12-02Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textScopus

Search in DiVA

By author/editor
Navas, ByronÖberg, JohnnySander, Ingo
By organisation
Electronic Systems
Other Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 52 hits
ReferencesLink to record
Permanent link

Direct link