Endre søk
Begrens søket
1 - 15 of 15
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Treff pr side
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sortering
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
  • Standard (Relevans)
  • Forfatter A-Ø
  • Forfatter Ø-A
  • Tittel A-Ø
  • Tittel Ø-A
  • Type publikasjon A-Ø
  • Type publikasjon Ø-A
  • Eldste først
  • Nyeste først
  • Skapad (Eldste først)
  • Skapad (Nyeste først)
  • Senast uppdaterad (Eldste først)
  • Senast uppdaterad (Nyeste først)
  • Disputationsdatum (tidligste først)
  • Disputationsdatum (siste først)
Merk
Maxantalet träffar du kan exportera från sökgränssnittet är 250. Vid större uttag använd dig av utsökningar.
  • 1.
    Abraham, Mark James
    et al.
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Skolan för teknikvetenskap (SCI), Fysik, Teoretisk biologisk fysik.
    Murtola, Teemu
    Schulz, Roland
    Pall, Szilard
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Skolan för teknikvetenskap (SCI), Fysik, Teoretisk biologisk fysik.
    Smith, Jeremy C.
    Hess, Berk
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Skolan för teknikvetenskap (SCI), Fysik, Teoretisk biologisk fysik.
    Lindahl, Erik
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Skolan för teknikvetenskap (SCI), Fysik, Teoretisk biologisk fysik.
    GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers2015Inngår i: SoftwareX, E-ISSN 2352-7110, Vol. 1-2, s. 19-25Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    GROMACS is one of the most widely used open-source and free software codes in chemistry, used primarily for dynamical simulations of biomolecules. It provides a rich set of calculation types, preparation and analysis tools. Several advanced techniques for free-energy calculations are supported. In version 5, it reaches new performance heights, through several new and enhanced parallelization algorithms. These work on every level; SIMD registers inside cores, multithreading, heterogeneous CPU–GPU acceleration, state-of-the-art 3D domain decomposition, and ensemble-level parallelization through built-in replica exchange and the separate Copernicus framework. The latest best-in-class compressed trajectory storage format is supported.

    Fulltekst (pdf)
    fulltext
  • 2.
    Alekseenko, Andrey
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Pall, Szilard
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Comparing the Performance of SYCL Runtimes for Molecular Dynamics Applications2023Inngår i: International Workshop on OpenCL (IWOCL ’23), ACM Digital Library, 2023, artikkel-id 6Konferansepaper (Fagfellevurdert)
    Abstract [en]

    SYCL is a cross-platform, royalty-free standard for programming a wide range of hardware accelerators. It is a powerful and convenient way to write standard C++ 17 code that can take full advantage of available devices. There are already multiple SYCL implementations targeting a wide range of platforms, from embedded to HPC clusters. Since several implementations can target the same hardware, application developers and users must know how to choose the most fitting runtime for their needs. In this talk, we will compare the runtime performance of two major SYCL runtimes targeting GPUs, oneAPI DPC++ and Open SYCL [3], to the native implementations for the purposes of GROMACS, a high-performance molecular dynamics engine.Molecular dynamics (MD) applications were one of the earliest adopters of GPU acceleration, with force calculations being an obvious target for offloading. It is an iterative algorithm where, in its most basic form, on each step, forces acting between particles are computed, and then the equations of motions are integrated. As the computational power of the GPUs grew, the strong scaling problem became apparent: the biophysical systems modeled with molecular dynamics typically have fixed sizes, and the goal is to perform more time steps, each taking less than a millisecond of wall time. This places high demands on the underlying GPU framework, requiring it to efficiently schedule multiple small tasks with minimal overhead, allowing to achieve overlap between CPU and GPU work for large systems and allowing to keep GPU occupied for smaller systems. Another requirement is the ability of application developers to have control over the scheduling to optimize for external dependencies, such as MPI communication.GROMACS is a widely-used MD engine, supporting a wide range of hardware and software platforms, from laptops to the largest supercomputers [1]. Portability and performance across multiple architectures have always been one of the primary goals of the project, necessary to keep the code not only efficient but also maintainable. The initial support for NVIDIA accelerators, using CUDA, was added to GROMACS in 2010. Since then, heterogeneous parallelization has been a major target for performance optimization, not limited to NVIDIA devices but later adding support for GPUs of other vendors, as well as Xeon Phi accelerators. GROMACS initially adopted SYCL in its 2021 release to replace its previous GPU portability layer, OpenCL [2]. In further releases, the number of offloading modes supported by the SYCL backend steadily increased. As of GROMACS 2023, SYCL support in GROMACS achieved near feature parity with CUDA while allowing the use of a single code to target the GPUs of all three major vendors with minimal specialization.While this clearly supports the portability promise of modern SYCL implementations, the performance of such portable code remains an open question, especially given the strict requirements of MD algorithms. In this talk, we compare the performance of GROMACS across a wide range of system sizes when using oneAPI DPC++ and Open SYCL runtimes on high-performance NVIDIA, AMD, and Intel GPUs. Besides the analysis of individual kernel performance, we focus on the runtime overhead and the efficiency of task scheduling when compared to a highly optimized implementation using the native frameworks and discuss the possible sources of suboptimal performance and the amount of vendor-specific code branches, such as intrinsics or workarounds for compiler bugs, required to achieve the optimal performance.

    Fulltekst (pdf)
    fulltext
  • 3.
    Alekseenko, Andrey
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Pall, Szilard
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Lindahl, Erik
    KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Experiences with Adding SYCL Support to GROMACS2021Inngår i: IWOCL'21: Proceedings International Workshop on OpenCL IWOCL 2021, Association for Computing Machinery (ACM) , 2021Konferansepaper (Fagfellevurdert)
    Abstract [en]

    GROMACS is an open-source, high-performance molecular dynamics (MD) package primarily used for biomolecular simulations, accounting for 5% of HPC utilization worldwide. Due to the extreme computing needs of MD, significant efforts are invested in improving the performance and scalability of simulations. Target hardware ranges from supercomputers to laptops of individual researchers and volunteers of distributed computing projects such as Folding@Home. The code has been designed both for portability and performance by explicitly adapting algorithms to SIMD and data-parallel processors. A SIMD intrinsic abstraction layer provides high CPU performance. Explicit GPU acceleration has long used CUDA to target NVIDIA devices and OpenCL for AMD/Intel devices. In this talk, we discuss the experiences and challenges of adding support for the SYCL platform into the established GROMACS codebase and share experiences and considerations in porting and optimization. While OpenCL offers the benefits of using the same code to target different hardware, it suffers from several drawbacks that add significant development friction. Its separate-source model leads to code duplication and makes changes complicated. The need to use C99 for kernels, while the rest of the codebase uses C++17, exacerbates these issues. Another problem is that OpenCL, while supported by most GPU vendors, is never the main framework and thus is not getting the primary support or tuning efforts. SYCL alleviates many of these issues, employing a single-source model based on the modern C++ standard. In addition to being the primary platform for Intel GPUs, the possibility to target AMD and NVIDIA GPUs through other implementations (e.g., hipSYCL) might make it possible to reduce the number of separate GPU ports that have to be maintained. Some design differences from OpenCL, such as flow directed acyclic graphs (DAGs) instead of in-order queues, made it necessary to reconsider the GROMACS's task scheduling approach and architectural choices in the GPU backend. Additionally, supporting multiple GPU platforms presents a challenge of balancing performance (low-level and hardware-specific code) and maintainability (more generalization and code-reuse). We will discuss the limitations of the existing codebase and interoperability layers with regards to adding the new platform; the compute performance and latency comparisons; code quality considerations; and the issues we encountered with SYCL implementations tested. Finally, we will discuss our goals for the next release cycle for the SYCL backend and the overall architecture of GPU acceleration code in GROMACS.

  • 4.
    Alekseenko, Andrey
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Pall, Szilard
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Lindahl, Erik
    KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability2024Inngår i: CUG2024 Proceedings, 2024Konferansepaper (Fagfellevurdert)
    Abstract [en]

    GROMACS is a widely-used molecular dynamics software package with a focus on performance, portability, and maintainability across a broad range of platforms. Thanks to its early algorithmic redesign and flexible heterogeneous parallelization, GROMACS has successfully harnessed GPU accelerators for more than a decade.With the diversification of accelerator platforms in HPC and no obvious choice for a well-suited multi-vendor programming model, the GROMACS project found itself at a crossroads. The performance and portability requirements, as well as a strong preference for a standards-based programming model, motivated our choice to use SYCL for production on both new HPC GPU platforms: AMD and Intel.Since the GROMACS 2022 release, the SYCL backend has been the primary means to target AMD GPUs in preparation for exascale HPC architectures like LUMI and Frontier.SYCL is a cross-platform, royalty-free, C++17-based standard for programming hardware accelerators, from embedded to HPC.It allows using the same code to target GPUs from all three major vendors with minimal specialization, which offers major portability benefits.While SYCL implementations build on native compilers and runtimes, whether such an approach is performant is not immediately evident.Biomolecular simulations have challenging performance characteristics: latency sensitivity, the need for strong scaling, and typical iteration times as short as hundreds of microseconds. Hence, obtaining good performance across the range of problem sizes and scaling regimes is particularly challenging.Here, we share the results of our work on readying GROMACS for AMD GPU platforms using SYCL,and demonstrate performance on Cray EX235a machines with MI250X accelerators. Our findings illustrate that portability is possible without major performance compromises.We provide a detailed analysis of node-level kernel and runtime performance with the aim of sharing best practices with the HPC community on using SYCL as a performance-portable GPU framework.

    Fulltekst (pdf)
    fulltext
  • 5.
    Hess, Berk
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Gong, Jing
    KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för datavetenskap och kommunikation (CSC), Centra, Parallelldatorcentrum, PDC.
    Pall, Szilard
    KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Schlatter, Philipp
    KTH, Skolan för teknikvetenskap (SCI), Mekanik. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Centra, Linné Flow Center, FLOW.
    Peplinski, Adam
    KTH, Skolan för teknikvetenskap (SCI), Mekanik, Stabilitet, Transition, Kontroll.
    Highly Tuned Small Matrix Multiplications Applied to Spectral Element Code Nek50002016Konferansepaper (Fagfellevurdert)
  • 6.
    Jansson, Niclas
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Karp, Martin
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Datavetenskap, Beräkningsvetenskap och beräkningsteknik (CST).
    Perez, Adalberto
    KTH, Skolan för teknikvetenskap (SCI), Teknisk mekanik, Strömningsmekanik och Teknisk Akustik.
    Mukha, Timofey
    KTH, Skolan för teknikvetenskap (SCI), Teknisk mekanik, Strömningsmekanik och Teknisk Akustik.
    Ju, Yi
    Max Planck Computing and Data Facility, Garching, Germany.
    Liu, Jiahui
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Datavetenskap, Beräkningsvetenskap och beräkningsteknik (CST).
    Pall, Szilard
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Laure, Erwin
    Max Planck Computing and Data Facility, Garching, Germany.
    Weinkauf, Tino
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Datavetenskap, Beräkningsvetenskap och beräkningsteknik (CST).
    Schumacher, Jörg
    Technische Universität Ilmenau, Ilmenau, Germany.
    Schlatter, Philipp
    Friedrich-Alexander-Universität (FAU) Erlangen-Nürnberg, Germany.
    Markidis, Stefano
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Datavetenskap, Beräkningsvetenskap och beräkningsteknik (CST).
    Exploring the Ultimate Regime of Turbulent Rayleigh–Bénard Convection Through Unprecedented Spectral-Element Simulations2023Inngår i: SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Association for Computing Machinery (ACM) , 2023, s. 1-9, artikkel-id 5Konferansepaper (Fagfellevurdert)
    Abstract [en]

    We detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major inno- vations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not pos- sible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolv- ing the long-standing question regarding the ultimate regime in RBC. 

  • 7. Kutzner, Carsten
    et al.
    Pall, Szilard
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik.
    Fechner, Martin
    Esztermann, Ansgar
    de Groot, Bert L.
    Grubmueller, Helmut
    Best bang for your buck: GPU nodes for GROMACS biomolecular simulations2015Inngår i: Journal of Computational Chemistry, ISSN 0192-8651, E-ISSN 1096-987X, Vol. 36, nr 26, s. 1990-2008Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    The molecular dynamics simulation package GROMACS runs efficiently on a wide variety of hardware from commodity workstations to high performance computing clusters. Hardware features are well-exploited with a combination of single instruction multiple data, multithreading, and message passing interface (MPI)-based single program multiple data/multiple program multiple data parallelism while graphics processing units (GPUs) can be used as accelerators to compute interactions off-loaded from the CPU. Here, we evaluate which hardware produces trajectories with GROMACS 4.6 or 5.0 in the most economical way. We have assembled and benchmarked compute nodes with various CPU/GPU combinations to identify optimal compositions in terms of raw trajectory production rate, performance-to-price ratio, energy efficiency, and several other criteria. Although hardware prices are naturally subject to trends and fluctuations, general tendencies are clearly visible. Adding any type of GPU significantly boosts a node's simulation performance. For inexpensive consumer-class GPUs this improvement equally reflects in the performance-to-price ratio. Although memory issues in consumer-class GPUs could pass unnoticed as these cards do not support error checking and correction memory, unreliable GPUs can be sorted out with memory checking tools. Apart from the obvious determinants for cost-efficiency like hardware expenses and raw performance, the energy consumption of a node is a major cost factor. Over the typical hardware lifetime until replacement of a few years, the costs for electrical power and cooling can become larger than the costs of the hardware itself. Taking that into account, nodes with a well-balanced ratio of CPU and consumer-class GPU resources produce the maximum amount of GROMACS trajectory over their lifetime.

  • 8.
    Kutzner, Carsten
    et al.
    Max Planck Inst Biophys Chem, Theoret & Computat Biophys, Fassberg 11, D-37077 Gottingen, Germany.
    Páll, Szilard
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Fechner, Martin
    Esztermann, Ansgar
    de Groot, Bert L.
    Grubmüller, Helmut
    More bang for your buck: Improved use of GPU nodes for GROMACS 20182019Inngår i: Journal of Computational Chemistry, ISSN 0192-8651, E-ISSN 1096-987X, Vol. 40, nr 27, s. 2418-2431Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    We identify hardware that is optimal to produce molecular dynamics (MD) trajectories on Linux compute clusters with the GROMACS 2018 simulation package. Therefore, we benchmark the GROMACS performance on a diverse set of compute nodes and relate it to the costs of the nodes, which may include their lifetime costs for energy and cooling. In agreement with our earlier investigation using GROMACS 4.6 on hardware of 2014, the performance to price ratio of consumer GPU nodes is considerably higher than that of CPU nodes. However, with GROMACS 2018, the optimal CPU to GPU processing power balance has shifted even more toward the GPU. Hence, nodes optimized for GROMACS 2018 and later versions enable a significantly higher performance to price ratio than nodes optimized for older GROMACS versions. Moreover, the shift toward GPU processing allows to cheaply upgrade old nodes with recent GPUs, yielding essentially the same performance as comparable brand-new hardware.

  • 9.
    Pall, Szilard
    et al.
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Schultz, Roland
    Intel Corp, Santa Clara, CA USA..
    Advances in the OpenCL offload support in GROMACS2019Inngår i: Proceedings of the international workshop on OPENCL (IWOCL'19), Association for Computing Machinery (ACM) , 2019Konferansepaper (Fagfellevurdert)
    Abstract [en]

    GROMACS is a molecular dynamics (MD) simulation package widely used in research and education on machines ranging from laptops to workstation to the largest supercomputers. Built on a highly portable free and open source codebase GROMACS is known to have among the fastest simulation engines thanks to highly tuned kernels for more than a dozen processor architectures. For CPU architectures it relies on SIMD intrinsics-based code, while for GPUs besides the dominance CUDA platform, OpenCL is also supported on NVIDIA, AMD and Intel GPUs and is actively developed. This talk aims to present the recent advances in improved offload capabilities and broader platform support of the GROMACS OpenCL codebase.

    With a long history of CUDA support, in an effort to maintain the portability to platforms alternative to the dominant accelerator platform, an OpenCL port was developed four years ago and has been successfully used predominantly on AMD GPUs. Despite the modest user-base, recent efforts have focused on achieving feature parity with the CUDA codebase. The offload of additional computation (the particle mesh ewald solver) aims to compensate for the shift in the performance advantage of GPUs and resulting runtime imbalance as well as to better support dense accelerator nodes. Performance improvement of up to 1.5x can be seen on workstations equipped with AMD Vega GPUs.

    Additionally, platform support has been expanded to Intel iG-PUs. Tweaks to the underlying pair-interaction algorithm setup were necessary to reach a good performance. We observe 5-25% performance benefit in an asynchronous offload scenario running concurrently on both on the CPU cores and the iGPU compared to only using the highly tuned SIMD intrinsics code on the CPU cores. By leaving the iGPU a larger fraction of the limited power budget of a mobile processor, application performance improved which suggests that a configurable TDP allocation to match the computational load with the hardware balance would be beneficial. Such results will become especially useful as most future high performance processor architectures will increase integration and will feature on-chip heterogeneity with different components more or less well suited for different parts of an HPC application.

  • 10.
    Pall, Szilard
    et al.
    KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC.
    Zhmurov, Artem
    KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Parallelldatorcentrum, PDC. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Bauer, Paul
    KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Abraham, Mark James
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Lundborg, Magnus
    ERCO Pharma AB, Stockholm, Sweden..
    Gray, Alan
    NVIDIA Corp, Reading, Berks, England..
    Hess, Berk
    KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik.
    Lindahl, Erik
    KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Centra, Science for Life Laboratory, SciLifeLab. KTH, Skolan för teknikvetenskap (SCI), Tillämpad fysik, Biofysik. Stockholm Univ, Dept Biochem & Biophys, Sci Life Lab, Box 1031, S-17121 Solna, Sweden..
    Heterogeneous parallelization and acceleration of molecular dynamics simulations in GROMACS2020Inngår i: Journal of Chemical Physics, ISSN 0021-9606, E-ISSN 1089-7690, Vol. 153, nr 13, artikkel-id 134110Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    The introduction of accelerator devices such as graphics processing units (GPUs) has had profound impact on molecular dynamics simulations and has enabled order-of-magnitude performance advances using commodity hardware. To fully reap these benefits, it has been necessary to reformulate some of the most fundamental algorithms, including the Verlet list, pair searching, and cutoffs. Here, we present the heterogeneous parallelization and acceleration design of molecular dynamics implemented in the GROMACS codebase over the last decade. The setup involves a general cluster-based approach to pair lists and non-bonded pair interactions that utilizes both GPU and central processing unit (CPU) single instruction, multiple data acceleration efficiently, including the ability to load-balance tasks between CPUs and GPUs. The algorithm work efficiency is tuned for each type of hardware, and to use accelerators more efficiently, we introduce dual pair lists with rolling pruning updates. Combined with new direct GPU-GPU communication and GPU integration, this enables excellent performance from single GPU simulations through strong scaling across multiple GPUs and efficient multi-node parallelization.

  • 11.
    Pronk, Sander
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    Pall, Szilard
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    Schulz, Roland
    Larsson, Per
    Bjelkmar, Pär
    Apostolov, Rossen
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    Shirts, Michael R.
    Smith, Jeremy C.
    Kasson, Peter M.
    van der Spoel, David
    Hess, Berk
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    Lindahl, Erik
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit2013Inngår i: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 29, nr 7, s. 845-854Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    Motivation: Molecular simulation has historically been a low-throughput technique, but faster computers and increasing amounts of genomic and structural data are changing this by enabling large-scale automated simulation of, for instance, many conformers or mutants of biomolecules with or without a range of ligands. At the same time, advances in performance and scaling now make it possible to model complex biomolecular interaction and function in a manner directly testable by experiment. These applications share a need for fast and efficient software that can be deployed on massive scale in clusters, web servers, distributed computing or cloud resources. Results: Here, we present a range of new simulation algorithms and features developed during the past 4 years, leading up to the GROMACS 4.5 software package. The software now automatically handles wide classes of biomolecules, such as proteins, nucleic acids and lipids, and comes with all commonly used force fields for these molecules built-in. GROMACS supports several implicit solvent models, as well as new free-energy algorithms, and the software now uses multithreading for efficient parallelization even on low-end systems, including windows-based workstations. Together with hand-tuned assembly kernels and state-of-the-art parallelization, this provides extremely high performance and cost efficiency for high-throughput as well as massively parallel simulations.

    Fulltekst (pdf)
    fulltext
  • 12.
    Páll, Szilard
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    Abraham, Mark James
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    Kutzner, Carsten
    Hess, Berk
    Lindahl, Erik
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, Science for Life Laboratory, SciLifeLab. Department of Biochemistry & Biophysics, Center for Biomembrane Research, Stockholm University, Stockholm, Sweden .
    Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS2015Inngår i: Solving software challenges for exascale, Springer Publishing Company, 2015, s. 3-27Konferansepaper (Fagfellevurdert)
    Abstract [en]

    GROMACS is a widely used package for biomolecular simulation, and over the last two decades it has evolved from small-scale efficiency to advanced heterogeneous acceleration and multi-level parallelism targeting some of the largest supercomputers in the world. Here, we describe some of the ways we have been able to realize this through the use of parallelization on all levels, combined with a constant focus on absolute performance. Release 4.6 of GROMACS uses SIMD acceleration on a wide range of architectures, GPU offloading acceleration, and both OpenMP and MPI parallelism within and between nodes, respectively. The recent work on acceleration made it necessary to revisit the fundamental algorithms of molecular simulation, including the concept of neighborsearching, and we discuss the present and future challenges we see for exascale simulation - in particular a very fine-grained task parallelism. We also discuss the software management, code peer review and continuous integration testing required for a project of this complexity.

    Fulltekst (pdf)
    fulltext
  • 13.
    Páll, Szilard
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    Hess, Berk
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik. KTH, Centra, SeRC - Swedish e-Science Research Centre. KTH, Centra, Science for Life Laboratory, SciLifeLab.
    A flexible algorithm for calculating pair interactions on SIMD architectures2013Inngår i: Computer Physics Communications, ISSN 0010-4655, E-ISSN 1879-2944, Vol. 184, nr 12, s. 2641-2650Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    Calculating interactions or correlations between pairs of particles is typically the most time-consuming task in particle simulation or correlation analysis. Straightforward implementations using a double loop over particle pairs have traditionally worked well, especially since compilers usually do a good job of unrolling the inner loop. In order to reach high performance on modern CPU and accelerator architectures, single-instruction multiple-data (SIMD) parallelization has become essential. Avoiding memory bottlenecks is also increasingly important and requires reducing the ratio of memory to arithmetic operations. Moreover, when pairs only interact within a certain cut-off distance, good SIMD utilization can only be achieved by reordering input and output data, which quickly becomes a limiting factor. Here we present an algorithm for SIMD parallelization based on grouping a fixed number of particles, e.g. 2, 4, or 8, into spatial clusters. Calculating all interactions between particles in a pair of such clusters improves data reuse compared to the traditional scheme and results in a more efficient SIMD parallelization. Adjusting the cluster size allows the algorithm to map to SIMD units of various widths. This flexibility not only enables fast and efficient implementation on current CPUs and accelerator architectures like GPUs or Intel MIC, but it also makes the algorithm future-proof. We present the algorithm with an application to molecular dynamics simulations, where we can also make use of the effective buffering the method introduces.

  • 14.
    Páll, Szilárd
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik.
    Hess, Berk
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik.
    Lindahl, Erik
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik, Beräkningsbiofysik.
    Poster - 3D Tixels: A highly efficient algorithm for GPU/CPU-acceleration of molecular dynamics on heterogeneous parallel architectures2011Inngår i: SC - Proc. High Perform. Comput. Networking, Storage Anal. Companion, Co-located SC, 2011, s. 71-72Konferansepaper (Fagfellevurdert)
    Abstract [en]

    Several GPU-based algorithms have been developed to ac-celerate biomolecular simulations, but although they pro-vide benefits over single-core implementations, they have not been able to surpass the performance of state-of-the art SIMD CPU implementations (e.g. GROMACS), not to mention efficient scaling. Here, we present a heteroge-nous parallelization that utilizes both CPU and GPU re-sources efficiently. A novel fixed-particle-number sub-cell algorithm for non-bonded force calculation was developed. The algorithm uses the SIMD width as algorithmic work unit, it is intrinsically future-proof since it can be adapted to future hardware. The CUDA non-bonded kernel imple-mentation achieves up to 60% work-efficiency, 1.5 IPC, and 95% L1 cache utilization. On the CPU OpenMP-parallelized SSE-accelerated code runs overlapping with GPU execution. Fully automated dynamic inter-process as well as CPU-GPU load balancing is employed. We achieve threefold speedup compared to equivalent GROMACS CPU code and show good strong and weak scaling. To the best of our knowledge this the fastest GPU molecular dynamics implementation presented to date.

  • 15.
    Wennberg, Christian L.
    et al.
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Murtola, Teemu
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Pall, Szilard
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Abraham, Mark James
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Hess, Berk
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Lindahl, Erik
    KTH, Skolan för teknikvetenskap (SCI), Teoretisk fysik. KTH, Centra, SeRC - Swedish e-Science Research Centre.
    Direct-Space Corrections Enable Fast and Accurate Lorentz-Berthelot Combination Rule Lennard-Jones Lattice Summation2015Inngår i: Journal of Chemical Theory and Computation, ISSN 1549-9618, E-ISSN 1549-9626, Vol. 11, nr 12, s. 5737-5746Artikkel i tidsskrift (Fagfellevurdert)
    Abstract [en]

    Long-range lattice summation techniques such as the particle-mesh Ewald (PME) algorithm for electrostatics have been revolutionary to the precision and accuracy of molecular simulations in general. Despite the performance penalty associated with lattice summation electrostatics, few biomolecular simulations today are performed without it. There are increasingly strong arguments for moving in the same direction for Lennard-Jones (LJ) interactions, and by using geometric approximations of the combination rules in reciprocal space, we have been able to make a very high-performance implementation available in GROMACS. Here, we present a new way to correct for these approximations to achieve exact treatment of Lorentz-Berthelot combination rules within the cutoff, and only a very small approximation error remains outside the cutoff (a part that would be completely ignored without LJ-PME). This not only improves accuracy by almost an order of magnitude but also achieves absolute biomolecular simulation performance that is an order of magnitude faster than any other available lattice summation technique for LJ interactions. The implementation includes both CPU and GPU acceleration, and its combination with improved scaling LJ-PME simulations now provides performance close to the truncated potential methods in GROMACS but with much higher accuracy.

    Fulltekst (pdf)
    fulltext
1 - 15 of 15
RefereraExporteraLink til resultatlisten
Permanent link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf