kth.sePublications
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 13) Show all publications
Jansson, N., Karp, M., Perez, A., Mukha, T., Ju, Y., Liu, J., . . . Markidis, S. (2023). Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection Through Unprecedented Spectral-Element Simulations. In: SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis: . Paper presented at SC: The International Conference for High Performance Computing, Networking, Storage, and Analysis, NOV 12–17 DENVER, CO, USA (pp. 1-9). Association for Computing Machinery (ACM), Article ID 5.
Open this publication in new window or tab >>Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection Through Unprecedented Spectral-Element Simulations
Show others...
2023 (English)In: SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Association for Computing Machinery (ACM) , 2023, p. 1-9, article id 5Conference paper, Published paper (Refereed)
Abstract [en]

We detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major inno- vations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not pos- sible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolv- ing the long-standing question regarding the ultimate regime in RBC. 

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
National Category
Computer Sciences Fluid Mechanics and Acoustics
Identifiers
urn:nbn:se:kth:diva-340333 (URN)10.1145/3581784.3627039 (DOI)2-s2.0-85179549233 (Scopus ID)
Conference
SC: The International Conference for High Performance Computing, Networking, Storage, and Analysis, NOV 12–17 DENVER, CO, USA
Funder
Swedish Research Council, 2019-04723Swedish e‐Science Research CenterEU, Horizon 2020, 101093393, 101092621, 956748
Note

Part of ISBN 9798400701092

QC 20231204

Available from: 2023-12-04 Created: 2023-12-04 Last updated: 2024-04-22Bibliographically approved
Alekseenko, A., Pall, S. & Lindahl, E. (2021). Experiences with Adding SYCL Support to GROMACS. In: IWOCL'21: Proceedings International Workshop on OpenCL IWOCL 2021. Paper presented at 2021 International Workshop on OpenCL, IWOCL 2021, Munich Germany, April, 2021, Virtual, Online. Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Experiences with Adding SYCL Support to GROMACS
2021 (English)In: IWOCL'21: Proceedings International Workshop on OpenCL IWOCL 2021, Association for Computing Machinery (ACM) , 2021Conference paper, Published paper (Refereed)
Abstract [en]

GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS's task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Series
ACM Proceedings
Keywords
GROMACS, Heterogeneous acceleration, SYCL
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-309871 (URN)10.1145/3456669.3456690 (DOI)2-s2.0-85105534079 (Scopus ID)
Conference
2021 International Workshop on OpenCL, IWOCL 2021, Munich Germany, April, 2021, Virtual, Online
Note

Part of conference proceedings: ISBN 978-145039033-0

QC 20220314

Available from: 2022-03-14 Created: 2022-03-14 Last updated: 2022-06-25Bibliographically approved
Pall, S., Zhmurov, A., Bauer, P., Abraham, M. J., Lundborg, M., Gray, A., . . . Lindahl, E. (2020). Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS. Journal of Chemical Physics, 153(13), Article ID 134110.
Open this publication in new window or tab >>Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS
Show others...
2020 (English)In: Journal of Chemical Physics, ISSN 0021-9606, E-ISSN 1089-7690, Vol. 153, no 13, article id 134110Article in journal (Refereed) Published
Abstract [en]

The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these benefits, it has been necessary to reformulate some of the most fundamental algorithms, including the Verlet list, pair searching, and cutoffs. Here, we present the heterogeneous parallelization and acceleration design of molecular dynamics implemented in the GROMACS codebase over the last decade. The setup involves a general cluster-based approach to pair lists and non-bonded pair interactions that utilizes both GPU and central processing unit (CPU) single instruction, multiple data acceleration efficiently, including the ability to load-balance tasks between CPUs and GPUs. The algorithm work efficiency is tuned for each type of hardware, and to use accelerators more efficiently, we introduce dual pair lists with rolling pruning updates. Combined with new direct GPU-GPU communication and GPU integration, this enables excellent performance from single GPU simulations through strong scaling across multiple GPUs and efficient multi-node parallelization.

Place, publisher, year, edition, pages
AIP Publishing, 2020
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-285625 (URN)10.1063/5.0018516 (DOI)000578502400002 ()33032406 (PubMedID)2-s2.0-85092604813 (Scopus ID)
Note

QC 20201110

Available from: 2020-11-10 Created: 2020-11-10 Last updated: 2022-06-25Bibliographically approved
Pall, S. & Schultz, R. (2019). Advances in the OpenCL offload support in GROMACS. In: Proceedings of the international workshop on OPENCL (IWOCL'19): . Paper presented at International workshop on OPENCL (IWOCL'19). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Advances in the OpenCL offload support in GROMACS
2019 (English)In: Proceedings of the international workshop on OPENCL (IWOCL'19), Association for Computing Machinery (ACM) , 2019Conference paper, Published paper (Refereed)
Abstract [en]

GROMACS is a molecular dynamics (MD) simulation package widely used in research and education on machines ranging from laptops to workstation to the largest supercomputers. Built on a highly portable free and open source codebase GROMACS is known to have among the fastest simulation engines thanks to highly tuned kernels for more than a dozen processor architectures. For CPU architectures it relies on SIMD intrinsics-based code, while for GPUs besides the dominance CUDA platform, OpenCL is also supported on NVIDIA, AMD and Intel GPUs and is actively developed. This talk aims to present the recent advances in improved offload capabilities and broader platform support of the GROMACS OpenCL codebase.

With a long history of CUDA support, in an effort to maintain the portability to platforms alternative to the dominant accelerator platform, an OpenCL port was developed four years ago and has been successfully used predominantly on AMD GPUs. Despite the modest user-base, recent efforts have focused on achieving feature parity with the CUDA codebase. The offload of additional computation (the particle mesh ewald solver) aims to compensate for the shift in the performance advantage of GPUs and resulting runtime imbalance as well as to better support dense accelerator nodes. Performance improvement of up to 1.5x can be seen on workstations equipped with AMD Vega GPUs.

Additionally, platform support has been expanded to Intel iG-PUs. Tweaks to the underlying pair-interaction algorithm setup were necessary to reach a good performance. We observe 5-25% performance benefit in an asynchronous offload scenario running concurrently on both on the CPU cores and the iGPU compared to only using the highly tuned SIMD intrinsics code on the CPU cores. By leaving the iGPU a larger fraction of the limited power budget of a mobile processor, application performance improved which suggests that a configurable TDP allocation to match the computational load with the hardware balance would be beneficial. Such results will become especially useful as most future high performance processor architectures will increase integration and will feature on-chip heterogeneity with different components more or less well suited for different parts of an HPC application.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2019
Keywords
OpenCL, GPGPU, GPU offload, molecular dynamics, GROMACS
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-279924 (URN)10.1145/3318170.3318176 (DOI)000559062300007 ()2-s2.0-85069189371 (Scopus ID)
Conference
International workshop on OPENCL (IWOCL'19)
Note

Part of ISBN 9781450362306

QC 20200909

Available from: 2020-09-09 Created: 2020-09-09 Last updated: 2024-03-11Bibliographically approved
Kutzner, C., Páll, S., Fechner, M., Esztermann, A., de Groot, B. L. & Grubmüller, H. (2019). More bang for your buck: Improved use of GPU nodes for GROMACS 2018. Journal of Computational Chemistry, 40(27), 2418-2431
Open this publication in new window or tab >>More bang for your buck: Improved use of GPU nodes for GROMACS 2018
Show others...
2019 (English)In: Journal of Computational Chemistry, ISSN 0192-8651, E-ISSN 1096-987X, Vol. 40, no 27, p. 2418-2431Article in journal (Refereed) Published
Abstract [en]

We identify hardware that is optimal to produce molecular dynamics (MD) trajectories on Linux compute clusters with the GROMACS 2018 simulation package. Therefore, we benchmark the GROMACS performance on a diverse set of compute nodes and relate it to the costs of the nodes, which may include their lifetime costs for energy and cooling. In agreement with our earlier investigation using GROMACS 4.6 on hardware of 2014, the performance to price ratio of consumer GPU nodes is considerably higher than that of CPU nodes. However, with GROMACS 2018, the optimal CPU to GPU processing power balance has shifted even more toward the GPU. Hence, nodes optimized for GROMACS 2018 and later versions enable a significantly higher performance to price ratio than nodes optimized for older GROMACS versions. Moreover, the shift toward GPU processing allows to cheaply upgrade old nodes with recent GPUs, yielding essentially the same performance as comparable brand-new hardware.

Place, publisher, year, edition, pages
Wiley, 2019
Keywords
molecular dynamics, GPU, parallel computing, energy efficiency, benchmark, GROMACS, computer simulations, CUDA, performance to price, high throughput MD
National Category
Communication Systems
Identifiers
urn:nbn:se:kth:diva-255767 (URN)10.1002/jcc.26011 (DOI)000477372600001 ()31260119 (PubMedID)2-s2.0-85068468606 (Scopus ID)
Note

QC 20190812

Available from: 2019-08-12 Created: 2019-08-12 Last updated: 2024-03-15Bibliographically approved
Hess, B., Gong, J., Pall, S., Schlatter, P. & Peplinski, A. (2016). Highly Tuned Small Matrix Multiplications Applied to Spectral Element Code Nek5000. In: : . Paper presented at The Third International Workshop on Sustainable Ultrascale Computing Systems (pp. 69-72).
Open this publication in new window or tab >>Highly Tuned Small Matrix Multiplications Applied to Spectral Element Code Nek5000
Show others...
2016 (English)Conference paper, Published paper (Refereed)
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-227955 (URN)
Conference
The Third International Workshop on Sustainable Ultrascale Computing Systems
Note

QC 20180607

Available from: 2018-05-15 Created: 2018-05-15 Last updated: 2024-03-15Bibliographically approved
Kutzner, C., Pall, S., Fechner, M., Esztermann, A., de Groot, B. L. & Grubmueller, H. (2015). Best bang for your buck: GPU nodes for GROMACS biomolecular simulations. Journal of Computational Chemistry, 36(26), 1990-2008
Open this publication in new window or tab >>Best bang for your buck: GPU nodes for GROMACS biomolecular simulations
Show others...
2015 (English)In: Journal of Computational Chemistry, ISSN 0192-8651, E-ISSN 1096-987X, Vol. 36, no 26, p. 1990-2008Article in journal (Refereed) Published
Abstract [en]

The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.

Keywords
molecular dynamics, GPU, parallel computing, energy efficiency, benchmark, MD, hybrid parallelization
National Category
Chemical Sciences
Identifiers
urn:nbn:se:kth:diva-173956 (URN)10.1002/jcc.24030 (DOI)000360807700007 ()26238484 (PubMedID)2-s2.0-84941180719 (Scopus ID)
Note

QC 20151006

Available from: 2015-10-06 Created: 2015-09-24 Last updated: 2022-06-23Bibliographically approved
Wennberg, C. L., Murtola, T., Pall, S., Abraham, M. J., Hess, B. & Lindahl, E. (2015). Direct-Space Corrections Enable Fast and Accurate Lorentz-Berthelot Combination Rule Lennard-Jones Lattice Summation. Journal of Chemical Theory and Computation, 11(12), 5737-5746
Open this publication in new window or tab >>Direct-Space Corrections Enable Fast and Accurate Lorentz-Berthelot Combination Rule Lennard-Jones Lattice Summation
Show others...
2015 (English)In: Journal of Chemical Theory and Computation, ISSN 1549-9618, E-ISSN 1549-9626, Vol. 11, no 12, p. 5737-5746Article in journal (Refereed) Published
Abstract [en]

Long-range lattice summation techniques such as the particle-mesh Ewald (PME) algorithm for electrostatics have been revolutionary to the precision and accuracy of molecular simulations in general. Despite the performance penalty associated with lattice summation electrostatics, few biomolecular simulations today are performed without it. There are increasingly strong arguments for moving in the same direction for Lennard-Jones (LJ) interactions, and by using geometric approximations of the combination rules in reciprocal space, we have been able to make a very high-performance implementation available in GROMACS. Here, we present a new way to correct for these approximations to achieve exact treatment of Lorentz-Berthelot combination rules within the cutoff, and only a very small approximation error remains outside the cutoff (a part that would be completely ignored without LJ-PME). This not only improves accuracy by almost an order of magnitude but also achieves absolute biomolecular simulation performance that is an order of magnitude faster than any other available lattice summation technique for LJ interactions. The implementation includes both CPU and GPU acceleration, and its combination with improved scaling LJ-PME simulations now provides performance close to the truncated potential methods in GROMACS but with much higher accuracy.

Place, publisher, year, edition, pages
American Chemical Society (ACS), 2015
National Category
Physical Sciences
Identifiers
urn:nbn:se:kth:diva-180232 (URN)10.1021/acs.jctc.5b00726 (DOI)000366223400017 ()26587968 (PubMedID)2-s2.0-84949640540 (Scopus ID)
Note

QC 20160119

Available from: 2016-01-19 Created: 2016-01-08 Last updated: 2022-06-23Bibliographically approved
Abraham, M. J., Murtola, T., Schulz, R., Pall, S., Smith, J. C., Hess, B. & Lindahl, E. (2015). GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX, 1-2, 19-25
Open this publication in new window or tab >>GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers
Show others...
2015 (English)In: SoftwareX, E-ISSN 2352-7110, Vol. 1-2, p. 19-25Article in journal (Refereed) Published
Abstract [en]

GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. These work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. The latest best-in-class compressed trajectory storage format is supported.

Place, publisher, year, edition, pages
Elsevier, 2015
Keywords
Molecular dynamics, GPU, SIMD, Free energy
National Category
Physical Sciences
Research subject
Chemistry
Identifiers
urn:nbn:se:kth:diva-248468 (URN)10.1016/j.softx.2015.06.001 (DOI)2-s2.0-84946416234 (Scopus ID)
Note

QC 20190429

Available from: 2019-04-09 Created: 2019-04-09 Last updated: 2022-06-26Bibliographically approved
Páll, S., Abraham, M. J., Kutzner, C., Hess, B. & Lindahl, E. (2015). Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS. In: Solving software challenges for exascale: . Paper presented at 2nd International Conference on Exascale Applications and Software (EASC), APR 02-03, 2014, Stockholm, SWEDEN (pp. 3-27). Springer Publishing Company
Open this publication in new window or tab >>Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
Show others...
2015 (English)In: Solving software challenges for exascale, Springer Publishing Company, 2015, p. 3-27Conference paper, Published paper (Refereed)
Abstract [en]

GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.

Place, publisher, year, edition, pages
Springer Publishing Company, 2015
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 8759
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-170715 (URN)10.1007/978-3-319-15976-8_1 (DOI)000355749700001 ()2-s2.0-84928911118 (Scopus ID)978-3-319-15975-1 (ISBN)978-3-319-15976-8 (ISBN)
Conference
2nd International Conference on Exascale Applications and Software (EASC), APR 02-03, 2014, Stockholm, SWEDEN
Funder
Science for Life Laboratory - a national resource center for high-throughput molecular bioscience
Note

QC 20150706

Available from: 2015-07-06 Created: 2015-07-03 Last updated: 2022-06-23Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0603-5514

Search in DiVA

Show all publications