kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
’1’-bit Count-based Sorting Unit to Reduce Link Power in DNN Accelerators
KTH, School of Electrical Engineering and Computer Science (EECS), Electronics and Embedded Systems.
KTH, School of Electrical Engineering and Computer Science (EECS), Electronics and Embedded Systems.ORCID iD: 0000-0001-8488-3506
KTH, School of Electrical Engineering and Computer Science (EECS), Electronics and Embedded Systems.
KTH, School of Electrical Engineering and Computer Science (EECS), Electronics and Embedded Systems.ORCID iD: 0000-0002-7693-6994
Show others and affiliations
2026 (English)In: Proceedings 2026 International VLSI Symposium on Technology, Systems and Applications (2026 VLSI TSA), 2026Conference paper, Published paper (Refereed)
Abstract [en]

Interconnect power consumption remains a bottleneck in Deep Neural Network (DNN) accelerators. While ordering data based on ’1’-bit counts can mitigate this via reduced switching activity, practical hardware sorting implementations remain underexplored. This work proposes the hardware implementation of a comparison-free sorting unit optimized for Convolutional Neural Networks(CNNs). Furthermore, by applying approximate computing, our design achieves hardware area reductions while preserving the link power benefits of data ordering. Our approximate sorting unit achieves up to 35.4% area reduction while maintaining 19.50% bit transition(BT) reduction compared to 20.42% achieved by the accurate implementation.

Place, publisher, year, edition, pages
2026.
National Category
Embedded Systems
Identifiers
URN: urn:nbn:se:kth:diva-381103OAI: oai:DiVA.org:kth-381103DiVA, id: diva2:2059273
Conference
2026 International VLSI Symposium on Technology, Systems and Applications (2026 VLSI TSA),Hsinchu,Taiwan, April 13-17, 2026
Available from: 2026-05-11 Created: 2026-05-11 Last updated: 2026-05-12
In thesis
1. Embedded Machine Learning: Reliability and Performance Enhancement
Open this publication in new window or tab >>Embedded Machine Learning: Reliability and Performance Enhancement
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Deploying machine learning (ML) on resource-constrained embedded systemspresents significant challenges in reliability as well as computational, memory, andpower efficiency. This thesis presents our research contributions across two mainthemes: reliability and performance optimization for embedded ML systems.

Regarding reliability, we developed an online image sensor fault detection method forautonomous vehicles, utilizing historical variance comparison to identify defectivepixels without interrupting camera functionality. We further investigated the impactof various image sensor faults on pruned neural networks for object detection,examining both spatial faults (blur, darkness, speckle noise) and temporal faultsarising from sensor aging. Our work addresses reliability from a system perspective,covering both sensor-level fault detection and neural network-level fault tolerance forembedded AI applications.

Regarding performance optimization, we aim to reduce latency, power, and energyconsumption in ML accelerators through task mapping, data encoding, andapproximate computing techniques. Specifically, we explored approximatecomputing through an FPGA-based non-negative matrix factorization acceleratorwith hybrid logarithmic approximation, achieving a 69× energy reduction comparedwith an ARM CPU implementation. For DNN accelerators, we proposed a traveltime-based task mapping strategy, achieving up to 12.1% latency reduction bydynamically balancing workloads across processing elements. We further developed a‘1’-bit count-based data ordering method to reduce bit transitions on NoC links,achieving up to 40.85% BT reduction and consequently lowering link powerconsumption.

To enable efficient hardware implementation, we designed anapproximate sorting unit for the data ordering method, achieving a 35.4% areareduction with only a 4.5% loss in BT reduction effectiveness (from 20.42% to19.50%). For emerging LLM architectures, we proposed a soft-edge quantizer forState Space Model quantization that improves accuracy by preserving outlierinformation through multi-scale quantization.

Abstract [sv]

Att implementera maskininlärning (ML) på resursbegränsade inbyggda systemmedför betydande utmaningar avseende tillförlitlighet samt beräknings-, minnes-och energieffektivitet. Denna avhandling presenterar våra forskningsbidrag inom tvåhuvudteman: tillförlitlighet och prestandaoptimering för inbyggdaML-system.

Avseende tillförlitlighet utvecklade vi en metod för online-feldetektering avbildsensorer för autonoma fordon, som utnyttjar historisk variansanalys för attidentifiera defekta pixlar utan att avbryta kamerans funktion. Vi undersökte vidarehur olika bildsensorfel påverkar prunade neurala nätverk för objektdetektering,inklusive både spatiala fel (oskärpa, mörker, fläckbrus) och temporala fel orsakade avsensoråldring. Vårt arbete behandlar tillförlitlighet ur ett systemperspektiv ochomfattar både feldetektering på sensornivå och feltolerans på neuralnätverksnivå förinbyggda AI-tillämpningar.

Avseende prestandaoptimering syftar vi till att minska latens, effektförbrukning ochenergiförbrukning i ML-acceleratorer genom uppgiftsmappning, datakodning ochapproximativa beräkningstekniker. Specifikt utforskade vi approximativ beräkninggenom en FPGA-baserad accelerator för icke-negativ matrisfaktorisering med hybridlogaritmisk approximation, vilken uppnådde en 69× energireduktion jämfört med enARM CPU-implementering. För DNN-acceleratorer föreslog vi en restidsbaseraduppgiftsmappningsstrategi som uppnådde upp till 12,1% latensreduktion genomdynamisk lastbalansering över beräkningsenheter. Vi utvecklade vidare en metod fördataordning baserad på antalet ‘1’-bitar för att minska bitövergångar på NoC-länkar,vilken uppnådde upp till 40,85% reduktion av bitövergångar och därmed lägrelänkeffektförbrukning. För att möjliggöra effektiv hårdvaruimplementeringkonstruerade vi en approximativ sorteringsenhet för dataordningsmetoden, vilkenuppnådde 35,4% areareduktion med endast 4,5% förlust i effektivitet avseendebitövergångsreduktion (från 20,42% till 19,50%). För kvantisering avtillståndsrumsmodeller i framväxande LLM-arkitekturer föreslog vi en soft-edge-kvantiserare som förbättrar noggrannheten genom att bevara informationom extremvärden via flerskalig kvantisering.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2026. p. 60
Series
TRITA-EECS-AVL ; 2026:46
Keywords
Embedded system, Machine learning, Network-on-Chip, AI Accelerator
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-381104 (URN)978-91-8106-610-4 (ISBN)
Public defence
2026-06-03, Kollegiesalen, https://kth-se.zoom.us/j/63924695817; Brinellvägen 8, KTH Royal Institute of Technology, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20260512

Available from: 2026-05-12 Created: 2026-05-11 Last updated: 2026-05-19Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Han, RuichiChen, YizhiLei, TongAltayo Gonzalez, JordiHemani, Ahmed

Search in DiVA

By author/editor
Han, RuichiChen, YizhiLei, TongAltayo Gonzalez, JordiHemani, Ahmed
By organisation
Electronics and Embedded Systems
Embedded Systems

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 25 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf