Scalable communication for high-order stencil computations using CUDA-aware MPIShow others and affiliations
2022 (English)In: Parallel Computing, ISSN 0167-8191, E-ISSN 1872-7336, Vol. 111, p. 102904-, article id 102904Article in journal (Refereed) Published
Abstract [en]
Modern compute nodes in high-performance computing provide a tremendous level of parallelism and processing power. However, as arithmetic performance has been observed to increase at a faster rate relative to memory and network bandwidths, optimizing data movement has become critical for achieving strong scaling in many communication-heavy applications. This performance gap has been further accentuated with the introduction of graphics processing units, which can provide by multiple factors higher throughput in data-parallel tasks than central processing units. In this work, we explore the computational aspects of iterative stencil loops and implement a generic communication scheme using CUDA-aware MPI, which we use to accelerate magnetohydrodynamics simulations based on high-order finite differences and third-order Runge-Kutta integration. We put particular focus on improving intra-node locality of workloads. Our GPU implementation scales strongly from one to 64 devices at 50%-87% of the expected efficiency based on a theoretical performance model. Compared with a multi-core CPU solver, our implementation exhibits 20-60x speedup and 9-12x improved energy efficiency in compute-bound benchmarks on 16 nodes.
Place, publisher, year, edition, pages
Elsevier BV , 2022. Vol. 111, p. 102904-, article id 102904
Keywords [en]
High-performance computing, Graphics processing units, Stencil computations, Computational physics, Magnetohydrodynamics
National Category
Computer Engineering
Identifiers
URN: urn:nbn:se:kth:diva-313523DOI: 10.1016/J.PARCO.2022.102904ISI: 000793751100002Scopus ID: 2-s2.0-85127169118OAI: oai:DiVA.org:kth-313523DiVA, id: diva2:1665542
Note
QC 20220607
2022-06-072022-06-072024-03-18Bibliographically approved