Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
NVIDIA tensor core programmability, performance & precision
KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).ORCID iD: 0000-0003-0639-0639
KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).
KTH, School of Electrical Engineering and Computer Science (EECS), Computational Science and Technology (CST).ORCID iD: 0000-0002-9901-9857
Show others and affiliations
2018 (English)In: Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018, Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 522-531, article id 8425458Conference paper, Published paper (Refereed)
Abstract [en]

The NVIDIA Volta GPU microarchitecture introduces a specialized unit, called Tensor Core that performs one matrix-multiply-and-accumulate on 4x4 matrices per clock cycle. The NVIDIA Tesla V100 accelerator, featuring the Volta microarchitecture, provides 640 Tensor Cores with a theoretical peak performance of 125 Tflops/s in mixed precision. In this paper, we investigate current approaches to program NVIDIA Tensor Cores, their performances and the precision loss due to computation in mixed precision. Currently, NVIDIA provides three different ways of programming matrix-multiply-and-accumulate on Tensor Cores: the CUDA Warp Matrix Multiply Accumulate (WMMA) API, CUTLASS, a templated library based on WMMA, and cuBLAS GEMM. After experimenting with different approaches, we found that NVIDIA Tensor Cores can deliver up to 83 Tflops/s in mixed precision on a Tesla V100 GPU, seven and three times the performance in single and half precision respectively. A WMMA implementation of batched GEMM reaches a performance of 4 Tflops/s. While precision loss due to matrix multiplication with half precision input might be critical in many HPC applications, it can be considerably reduced at the cost of increased computation. Our results indicate that HPC applications using matrix multiplications can strongly benefit from using of NVIDIA Tensor Cores.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2018. p. 522-531, article id 8425458
Keywords [en]
GEMM, GPU Programming, Mixed Precision, NVIDIA Tensor Cores
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-234096DOI: 10.1109/IPDPSW.2018.00091Scopus ID: 2-s2.0-85052208514ISBN: 9781538655559 (print)OAI: oai:DiVA.org:kth-234096DiVA, id: diva2:1245066
Conference
32nd IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018, Vancouver, Canada, 21 May 2018 through 25 May 2018
Note

QC 20180904

Available from: 2018-09-04 Created: 2018-09-04 Last updated: 2018-09-04Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records BETA

Markidis, StefanoLaure, Erwin

Search in DiVA

By author/editor
Markidis, StefanoChien, Steven Wei DerLaure, Erwin
By organisation
Computational Science and Technology (CST)
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 146 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf