kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Scalable communication for high-order stencil computations using CUDA-aware MPI
Aalto Univ, Dept Comp Sci, Konemiehentie 2, Espoo 02150, Finland..
Acad Sinica, Inst Astron & Astrophys, Roosevelt Rd 1 Sec 4, Taipei 10617, Taiwan..ORCID iD: 0000-0002-8782-4664
Aalto Univ, Dept Comp Sci, Konemiehentie 2, Espoo 02150, Finland.;Max Planck Inst Solar Syst Res, Justus von Liebig Weg 3, D-37077 Gottingen, Germany.;KTH Royal Inst Technol, NORDITA, Hannes Alfvens Vag 12, SE-10691 Stockholm, Sweden.;Stockholm Univ, Hannes Alfvens Vag 12, SE-10691 Stockholm, Sweden..ORCID iD: 0000-0002-9614-2200
Aalto Univ, Dept Comp Sci, Konemiehentie 2, Espoo 02150, Finland..
Show others and affiliations
2022 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 111, p. 102904-, article id 102904Article in journal (Refereed) Published
Abstract [en]

Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to 64 devices at 50%-87% of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits 20-60x speedup and 9-12x improved energy efficiency in compute-bound benchmarks on 16 nodes.

Place, publisher, year, edition, pages
Elsevier BV , 2022. Vol. 111, p. 102904-, article id 102904
Keywords [en]
High-performance computing, Graphics processing units, Stencil computations, Computational physics, Magnetohydrodynamics
National Category
Computer Engineering
Identifiers
URN: urn:nbn:se:kth:diva-313523DOI: 10.1016/J.PARCO.2022.102904ISI: 000793751100002Scopus ID: 2-s2.0-85127169118OAI: oai:DiVA.org:kth-313523DiVA, id: diva2:1665542
Note

QC 20220607

Available from: 2022-06-07 Created: 2022-06-07 Last updated: 2024-03-18Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Käpylä, Maarit J.

Search in DiVA

By author/editor
Vaisala, Miikka S.Käpylä, Maarit J.
In the same journal
Parallel Computing
Computer Engineering

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 41 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf