kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Guided Fault Injection Strategy for Rapid Critical Bit Detection in Radiation-Prone SRAM-FPGA
KTH, School of Electrical Engineering and Computer Science (EECS), Electrical Engineering, Electronics and Embedded systems.ORCID iD: 0000-0002-1024-7897
KTH, School of Electrical Engineering and Computer Science (EECS), Electrical Engineering, Electronics and Embedded systems.ORCID iD: 0000-0002-8072-1742
2024 (English)In: 2024 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, DATE, IEEE, 2024Conference paper, Published paper (Refereed)
Abstract [en]

Fault injection test is vital for assessing the reliability of SRAM-FPGAs used in radiative environments. Considering the scale and complexity of modern FPGAs, exhaustive fault injection is tedious and computationally expensive. A common approach to optimising the injection campaign involves targeting a subset of the configuration memory containing essential and critical bits crucial for the system's functionality. Identifying Essential bits in an FPGA design is often feasible through manufacturer documentation. However, detecting Critical bits requires complex reverse engineering to map the correspondence between the configuration bits and the FPGA modules. This task requires substantial amount of details about the logic layout and the bitstream, which is not easily available due to their proprietary nature. In some cases, manual floorplanning becomes necessary, which could impact the performance of the application. Given these limitations, we examine the potential of Monte Carlo Tree Search in guiding the fault injection process to identify critical bits with minimal injections. The key benefit of this approach is its ability to harness the spatial relations among the configuration bits without relying on reverse engineering or offline campaign planning. Evaluation results demonstrate that the proposed approach achieves a 99% coverage using 18% fewer injections than traditional methods. Notably, 95% of the critical bits were detected in under 50% injections, achieving at least 2X higher sensitivity to critical bits with a minimal overhead of 0.04%.

Place, publisher, year, edition, pages
IEEE, 2024.
Series
Design Automation and Test in Europe Conference and Exhibition, ISSN 1530-1591
Keywords [en]
Emulation-based Fault injection, Critical bits, Monte Carlo Tree Search, Single Event Upset
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:kth:diva-356452ISI: 001253778900286Scopus ID: 2-s2.0-85196559010OAI: oai:DiVA.org:kth-356452DiVA, id: diva2:1914420
Conference
27th Design, Automation and Test in Europe Conference and Exhibition (DATE), MAR 25-27, 2024, Valencia, SPAIN
Note

Part of ISBN 979-8-3503-4860-6; 978-3-9819263-8-5

QC 20241119

Available from: 2024-11-19 Created: 2024-11-19 Last updated: 2025-12-16Bibliographically approved
In thesis
1. Advancing Dependability of SRAM-FPGA: Towards Improved Mitigation and Fault Injection Strategies for Single Event Upset
Open this publication in new window or tab >>Advancing Dependability of SRAM-FPGA: Towards Improved Mitigation and Fault Injection Strategies for Single Event Upset
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

SRAM-based field programmable gate arrays (SRAM-FPGAs) are a class of programmable integrated circuits that use static random-access memory (SRAM) cells to configure their logic and routing resources. These devices  play a pivotal role in digital computing owing to their inherent parallelism, high logic capacity and reconfigurability. These attributes have led to their widespread adoption in  space missions, aerospace, medical devices, data centers, nuclear reactors and high-energy particle accelerators.  In hazardous radiation environments, SRAM-FPGAs are valued not only for their high performance and cost-effectiveness but also for their ability to support design updates  with minimal manual intervention and no physical hardware modifications. However, these devices are vulnerable to single event upset (SEU), a radiation-induced error  which  inverts SRAM cell contents. Since the configuration memory, which stores the FPGA functionality and the routing information, is composed of SRAM cells, such changes can have catastrophic consequences in safety-critical applications. The continued CMOS scaling further exacerbates this problem through reduced feature size and increased volume of configuration bits. As devices shrink, they become more susceptible  to multiple errors that weaken the traditional mitigation schemes. Moreover, the exponential growth in configurable elements increase the cost and complexity of validation techniques such as fault injection.  Addressing these dependability challenges in SRAM-FPGAs forms the core objective of this thesis. To achieve this, we introduce techniques to (1) detect failures in the scrubber, a widely used mitigation scheme for configuration memory (2) optimize fault injection to reduce experimental time and (3) identify vulnerable areas of the FPGA fabric to streamline dependability efforts.

To detect scrubber failures, this work introduces two non-invasive, log-based frameworks: a Markov chain model for scrubber health monitoring, and AnoDe, a self-supervised failure detection system. They cater to varying levels of domain knowledge, with the former leveraging IP specifications and the latter requiring none, making them adaptable to diverse operational scenarios. For optimizing fault injection, a Bayesian sampling framework is proposed to reduce the number of injections by integrating prior knowledge with the observed data.  This method maintains the statistical confidence and the black-box nature of classical statistical fault injection  while addressing the inflated sample size caused by parameter uncertainty.  Finally, the thesis presents learning-based strategies using Monte Carlo Tree Search and Long Short-Term Memory  models to  identify critical bits in the configuration memory without reverse engineering the FPGA layout. These approaches integrate seamlessly into existing fault injection setups and are particularly valuable in environments with limited access to radiation facilities. Collectively, the methods developed in this work advance the state of the art in SEU resilience and enable the broader adoption of commercial SRAM-FPGAs in safety-critical domains. 

Abstract [sv]

SRAM-baserade fältprogrammerbara grindmatriser (SRAM-FPGA) är en klass av programmerbara integrerade kretsar som använder SRAM-celler (static random-access memory) för att konfigurera sina logik- och routing-resurser. Dessa enheter spelar en central roll inom digital databehandling på grund av sin inneboende parallellitet, höga logikkapacitet och omkonfigurerbarhet. Dessa egenskaper har lett till deras utbredda användning inom rymduppdrag, flyg- och rymdteknik, medicintekniska produkter, datacenter, kärnreaktorer och högenergipartikelacceleratorer. I miljöer med farlig strålning, som rymduppdrag och partikelacceleratorer, värderas SRAM FPGA:er inte bara för sin höga prestanda och kostnadseffektivitet, utan också för sin förmåga att stödja designuppdateringar med minimal manuell intervention och inga fysiska hård-varumodifieringar. Dessa enheter är dock sårbara för SEU (single event upset), ett strålningsinducerat fel som kan invertera SRAM-cellernas innehåll. Eftersom konfigurationsminnet, som lagrar FPGA-funktionaliteten och routing-informationen, består av SRAM-celler, kan sådana förändringar få katastrofala konsekvenser i säkerhetskritiska applikationer. Den fortsatta CMOS-skal-ningen förvärrar ytterligare detta problem. Miniatyriseringen av dessa enheter ökar deras känslighet för multipla fel som försvagar de befintliga riskreduceringssystemen. Den exponentiella tillväxten av konfigurerbara element ökar kostnaden och komplexiteten för valideringstekniker som felinjektion. Att ta itu med dessa tillförlitlighetsutmaningar i SRAM-FPGA:er utgör det centrala målet med denna avhandling.  För att uppnå detta introducerar vi tekniker för att (1) upptäcka fel i scrubbern, ett vanligt förekommande riskreduceringsschema för konfigurations-minne, (2) optimera felinjektion för att minska experimenttiden och (3) identifiera sårbara områden i FPGA-strukturen för att effektivisera tillförlitlighetsarbetet.

För att upptäcka skrubberfel introducerar detta arbete två icke-invasiva, loggbaserade ramverk: en Markov-kedjemodell för övervakning av skrubberns hälsa och AnoDe, ett självövervakande system för feldetektering. De tillgodoser olika nivåer av domänkunskap, där den förra utnyttjar IP-specifikationer och den senare inte kräver några, vilket gör dem anpassningsbara till olika driftsscenarier. För att optimera felinjicering föreslås ett Bayesianskt samplingsramverk för att minska antalet injektioner genom att integrera förkunska-per med observerade data. Denna metod bibehåller den statistiska säkerheten och svarta-låda karaktären hos klassisk statistisk felinjicering samtidigt som den åtgärdar den uppblåsta urvalsstorleken som orsakas av parameterosäkerhet. Slutligen presenterar avhandlingen inlärningsbaserade strategier med hjälp av Monte Carlo Tree Search och Long Short-Term Memory-modeller för att identifiera kritiska bitar i konfigurationsminnet utan omvända konstruktion av FPGA-layouten. Dessa metoder integreras sömlöst i befintliga felinjektionsuppsätt-ningar och är särskilt värdefulla i miljöer med begränsad tillgång till strålnings-anläggningar. Sammantaget främjar de metoder som utvecklats i detta arbete den senaste tekniken inom SEU-motståndskraft och möjliggör ett bredare införande av kommersiella SRAM-FPGA:er inom säkerhetskritiska områden.

Place, publisher, year, edition, pages
Stockholm: Kungliga Tekniska högskolan, 2025. p. xvi, 81
Series
TRITA-EECS-AVL ; 2025:91
Keywords
Single Event Upset, Fault Injection, Markov Chains, Bayesian Sampling, Long Short Term Memory, Monte Carlo Tree Search
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-371773 (URN)978-91-8106-420-9 (ISBN)
Public defence
2025-11-14, https://kth-se.zoom.us/j/69161493932, Kollegiesalen, Brinellvägen 8, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20251020

Available from: 2025-10-20 Created: 2025-10-20 Last updated: 2025-10-27Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Rajkumar, TrishnaÖberg, Johnny

Search in DiVA

By author/editor
Rajkumar, TrishnaÖberg, Johnny
By organisation
Electronics and Embedded systems
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 85 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf