kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Unleashing 8-Bit Floating Point Formats Out of the Deep-Learning Domain
University of Bologna, Bologna, Italy.
KTH, School of Electrical Engineering and Computer Science (EECS), Electrical Engineering, Electronics and Embedded systems.ORCID iD: 0000-0003-0354-7207
KTH, School of Electrical Engineering and Computer Science (EECS), Electrical Engineering, Electronics and Embedded systems.ORCID iD: 0000-0003-0565-9376
University of Bologna, Bologna, Italy.
2024 (English)In: 2024 31st IEEE International Conference on Electronics, Circuits and Systems, ICECS 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024Conference paper, Published paper (Refereed)
Abstract [en]

Reduced-precision floating-point (FP) arithmetic is a technology trend to minimize memory usage and execution time on power-constrained devices. This paper explores the potential applications of the 8-bit FP format beyond the classical deep learning use cases. We comprehensively analyze alternative FP8 formats, considering the allocation of mantissa and exponent bits. Additionally, we examine the impact on energy efficiency, accuracy, and execution time of several digital signal processing and classical machine learning kernels using the parallel ultra-low-power (PULP) platform based on the RISC-V instruction set architecture. Our findings show that using appropriate exponent choice and scaling methods results in acceptable errors compared to FP32. Our study facilitates the adoption of FP8 formats outside the deep learning domain to achieve consistent energy efficiency and speed improvements without compromising accuracy. On average, our results indicate speedup of 3.14x, 6.19x, 11.11x, and 18.81x on 1, 2, 4, and 8 cores, respectively. Furthermore, the vectorized implementation of FP8 in the same setup delivers remarkable energy savings of 2.97x, 5.07x, 7.37x, and 15.05x.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024.
Keywords [en]
approximate computing, float8, Parallel ultra-low-power platform, RISC-V, smallFloat data types
National Category
Computer Systems Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-360165DOI: 10.1109/ICECS61496.2024.10848785ISI: 001445799800055Scopus ID: 2-s2.0-85217619865OAI: oai:DiVA.org:kth-360165DiVA, id: diva2:1938782
Conference
31st IEEE International Conference on Electronics, Circuits and Systems, ICECS 2024, Nancy, France, Nov 18 2024 - Nov 20 2024
Note

Part of ISBN 979-8-3503-7720-0

QC 20250224

Available from: 2025-02-19 Created: 2025-02-19 Last updated: 2025-05-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Yousefzadeh, SabaHemani, Ahmed

Search in DiVA

By author/editor
Yousefzadeh, SabaHemani, Ahmed
By organisation
Electronics and Embedded systems
Computer SystemsComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 60 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf