Open this publication in new window or tab >>Show others...
2025 (English)In: Journal of Computational Science, ISSN 1877-7503, E-ISSN 1877-7511, Vol. 88, article id 102590Article in journal (Refereed) Published
Abstract [en]
As fusion energy devices advance, plasma simulations play a critical role in fusion reactor design. Particle-in-Cell Monte Carlo simulations are essential for modelling plasma-material interactions and analysing power load distributions on tokamak divertors. Previous work introduced hybrid parallelization in BIT1 using MPI and OpenMP/OpenACC for shared-memory and multicore CPU processing. In this extended work, we integrate MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming with OpenMP Target Tasks using the "nowait" and "depend" clauses, and OpenACC Parallel with the "async(n)" clause. Our results show significant performance improvements: 16 MPI ranks plus OpenMP threads reduced simulation runtime by 53% on a petascale EuroHPC supercomputer, while the OpenACC multicore implementation achieved a 58% reduction compared to the MPI-only version. Scaling to 64 MPI ranks, OpenACC outperformed OpenMP, achieving a 24% improvement in the particle mover function. On the HPE Cray EX supercomputer, OpenMP and OpenACC consistently reduced simulation times, with a 37% reduction at 100 nodes. Results from MareNostrum 5, a pre-exascale EuroHPC supercomputer, highlight OpenACC's effectiveness, with the "async(n)" configuration delivering notable performance gains. However, OpenMP asynchronous configurations outperform OpenACC at larger node counts, particularly for extreme scaling runs. As BIT1 scales asynchronously to 128 GPUs, OpenMP asynchronous multi-GPU configurations outperformed OpenACC in runtime, demonstrating superior scalability, which continues up to 400 GPUs, further improving runtime. Speedup and parallel efficiency (PE) studies reveal OpenMP asynchronous multi-GPU achieving an 8.77x speedup (54.81% PE) and OpenACC achieving an 8.14x speedup (50.87% PE) on MareNostrum 5, surpassing the CPU-only version. At higher node counts, PE declined across all implementations due to communication and synchronization costs. However, the asynchronous multi-GPU versions maintained better PE, demonstrating the benefits of asynchronous multi-GPU execution in reducing scalability bottlenecks. While the CPU-only implementation is faster in some cases, OpenMP's asynchronous multi-GPU approach delivers better GPU performance through asynchronous data transfer and task dependencies, ensuring data consistency and avoiding race conditions. Using NVIDIA Nsight tools, we confirmed BIT1's overall efficiency for large-scale plasma simulations, leveraging current and future exascale supercomputing infrastructures. Asynchronous data transfers and dedicated GPU assignments to MPI ranks enhance performance, with OpenMP’s asynchronous multi-GPU implementation utilizing OpenMP Target Tasks with "nowait" and "depend" clauses outperforming other configurations. This makes OpenMP the preferred application programming interface when performance portability, high throughput, and efficient GPU utilization are critical. This enables BIT1 to fully exploit modern supercomputing architectures, advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.
Place, publisher, year, edition, pages
Netherlands: Elsevier BV, 2025
Keywords
Hybrid Programming, OpenMP, Task-Based Parallelism, Dependency Management, OpenACC, Asynchronous Execution, Multi-GPU Offloading, Overlapping Kernels, Large-Scale PIC Simulations
National Category
Computer Systems Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-362742 (URN)10.1016/j.jocs.2025.102590 (DOI)2-s2.0-105003577843 (Scopus ID)
Funder
Swedish Research Council, 2022-06725KTH Royal Institute of Technology, 101093261
Note
QC 20250425
2025-04-242025-04-242025-05-27Bibliographically approved