kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (10 of 81) Show all publications
Ohm, P., Harper, G. & Jansson, N. (2026). A Matrix-Free Algebraic hp-Multigrid Method for Computational Fluid Dynamics Applications. In: Proceedings of Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026: . Paper presented at Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026, Osaka, Japan, January 26-29, 2026 (pp. 194-202). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>A Matrix-Free Algebraic hp-Multigrid Method for Computational Fluid Dynamics Applications
2026 (English)In: Proceedings of Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026, Association for Computing Machinery (ACM) , 2026, p. 194-202Conference paper, Published paper (Refereed)
Abstract [en]

We present an algebraic hp-multigrid method for high-order matrix-free methods. Algebraic multigrid methods often require information about matrix entries, which are not available in a matrix-free setting; however, when rediscretization for geometric multigrid is not available for a matrix-free method, coarsening must be constructed using information from the mesh. Leveraging only mesh adjacency information, this algorithm constructs an algebraic multigrid hierarchy without requiring geometric coarsening or explicit matrix assembly, making it well-suited for GPUĝ€'accelerated architectures. This paper presents the implementation of the matrix-free method in the high-fidelity computational fluid dynamics framework Neko, which utilizes spectral element methods with an implicit-explicit scheme to solve the incompressible Navier-Stokes equations. We utilize an hp-multigrid approach, where the problem is first coarsened from high-order polynomials to low-order polynomials, and then the low-order system is further coarsened spatially in an matrix-free fashion using mesh adjacency information. Finally, we present numerical results from the Dardel and LUMI supercomputers that demonstrate the performance and scalability of our method as well as its applicability to real-world applications.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2026
Keywords
Algebraic multigrid, hp-multigrid, Matrix-free, p-multigrid, preconditioning
National Category
Computational Mathematics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-378882 (URN)10.1145/3773656.3773686 (DOI)2-s2.0-105031770200 (Scopus ID)
Conference
Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026, Osaka, Japan, January 26-29, 2026
Note

Part of ISBN 9798400720673

QC 20260415

Available from: 2026-04-09 Created: 2026-04-09 Last updated: 2026-04-15Bibliographically approved
Du, S., Münsch, M., Jansson, N. & Schlatter, P. (2026). Assessment of the gradient jump penalisation in large-eddy simulations of turbulence. Acta Mechanica
Open this publication in new window or tab >>Assessment of the gradient jump penalisation in large-eddy simulations of turbulence
2026 (English)In: Acta Mechanica, ISSN 0001-5970, E-ISSN 1619-6937Article in journal (Refereed) Epub ahead of print
Abstract [en]

This research investigates the efficacy of the gradient jump penalisation (GJP) in large-eddy simulations (LES) when coupled with active subgrid-scale models. GJP is a stabilisation method tailored for the continuous Galerkin spectral element method, aiming at mitigating non-physical oscillations induced by discontinuous velocity gradients across element interfaces. We demonstrate that GJP effectively smoothens fields from LES without a salient impact on flow dynamics for the Taylor–Green vortex (TGV) at Re = 1600 , periodic hill flows at bulk Reynolds numbers Re b = 10 , 595 and 37,000, as well as turbulent channel flow at Re τ ≈ 550 . In the TGV case, the application of GJP results in decreased fluctuations at only high wavenumbers compared to simulations without GJP. The periodic hill flow simulations indicate the applicability of GJP in wall-resolved LES involving curved geometries, though it tends to dissipate some of the finer details in the solution. Finally, in the analysis of the canonical turbulent channel flow cases, GJP leads to a higher resolved turbulent kinetic energy than simulations without GJP and direct numerical simulations. GJP’s mechanism is identified as providing enhanced dissipation at high wavenumbers but accompanied with insufficient dissipation at low wavenumbers, leading to a pronounced spectral cut-off. Non-physical oscillations on element interfaces are reflected as spikes in the power spectral density. By evaluating the sharpness of the strongest spike, GJP is shown to smoothen the spectra, however, without completely removing the gradient jumps at low computational resolution.

Place, publisher, year, edition, pages
Springer Nature, 2026
National Category
Fluid Mechanics Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-375361 (URN)10.1007/s00707-025-04607-z (DOI)001654939600001 ()2-s2.0-105026775830 (Scopus ID)
Funder
Swedish e‐Science Research Center, M3EU, Horizon Europe, 101093393
Note

QC 20260114

Available from: 2026-01-13 Created: 2026-01-13 Last updated: 2026-01-14Bibliographically approved
Karp, M., Stanly, R., Mukha, T., Galimberti, L., Toosi, S., Song, H., . . . Schlatter, P. (2026). Effects of lower floating-point precision on scale-resolving numerical simulations of turbulence. Journal of Computational Physics, 549, Article ID 114600.
Open this publication in new window or tab >>Effects of lower floating-point precision on scale-resolving numerical simulations of turbulence
Show others...
2026 (English)In: Journal of Computational Physics, ISSN 0021-9991, E-ISSN 1090-2716, Vol. 549, article id 114600Article in journal (Refereed) Published
Abstract [en]

Modern computing clusters offer specialized hardware for reduced-precision arithmetic, which can significantly speed up the time to solution. This is possible due to a decrease in data movement, as well as the ability to perform arithmetic operations at a faster rate. However, for high-fidelity simulations of turbulence, such as direct and large-eddy simulation, the impact of reduced precision on the computed solution and the resulting uncertainty across flow solvers and different flow cases has not been explored in detail, and limits the optimal utilization of new high-performance computing systems. In this work, the effect of reduced precision is studied using four diverse computational fluid dynamics (CFD) solvers (two incompressible, Neko and Simson, and two compressible, PadeLibs and SSDC) using four test cases: turbulent channel flow at Reτ=550 and higher, forced transition in a channel, flow over a cylinder at ReD=3900, and compressible flow over a wing section at Rec=50000. We observe that the flow physics are remarkably robust with respect to reductions in lower floating-point precision, and that often other forms of uncertainty, due to, for example, time averaging, often have a much larger impact on the computed result. Our results indicate that different terms in the Navier–Stokes equations can be computed to a lower floating-point accuracy without affecting the results. In particular, standard IEEE single precision can be used effectively for the entirety of the simulation, showing no significant discrepancies from double-precision results across the solvers and cases considered. Potential pitfalls are also discussed. 

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Computational fluid dynamics, Direct numerical simulation, Floating-point precision, Turbulence
National Category
Fluid Mechanics Computational Mathematics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-375324 (URN)10.1016/j.jcp.2025.114600 (DOI)001654296600002 ()2-s2.0-105025717580 (Scopus ID)
Note

Not duplicate with DiVA 2002138

QC 20260112

Available from: 2026-01-12 Created: 2026-01-12 Last updated: 2026-01-12Bibliographically approved
Chen, Y., de Oliveira Castro, P., Bientinesi, P., Jansson, N. & Iakymchuk, R. (2026). Enabling mixed-precision in spectral element codes. Future Generation Computer Systems, 174, Article ID 107990.
Open this publication in new window or tab >>Enabling mixed-precision in spectral element codes
Show others...
2026 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 174, article id 107990Article in journal (Refereed) Published
Abstract [en]

Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone (Nek5000 developers), a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000 (Fischer et al.), and a modern Neko (Jansson et al., 2024) CFD application. With the help of the Verificarlo (Denis et al., 2016) tool and computer arithmetic techniques, we introduce a strategy to address stagnation issues in the preconditioned Conjugate Gradient method in Nekbone and apply these insights to implement a mixed-precision version of Neko. We evaluate the derived mixed-precision versions of these codes by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, mixed-precision in Nekbone reduces time-to-solution by roughly 1.62x and energy-to-solution by 2.43x on MareNostrum 5, while in the real-world Neko application, the gain is up to 1.3x in both time and energy, with the accuracy that matches double-precision results.

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Computer arithmetic tool, Conjugate gradient, Energy-to-solution, Mixed-precision, Neko, Roofline model, Verificarlo
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-368935 (URN)10.1016/j.future.2025.107990 (DOI)001528005900003 ()2-s2.0-105009726439 (Scopus ID)
Note

QC 20250828

Available from: 2025-08-28 Created: 2025-08-28 Last updated: 2025-11-13Bibliographically approved
Chien, S. W. .., Sato, K., Podobas, A., Jansson, N., Markidis, S. & Honda, M. (2026). ParaLog: Consistent Host-side Logging for Parallel Checkpoints. In: SoCC 2025 - Proceedings of the 2025 ACM Symposium on Cloud Computing: . Paper presented at 2025 ACM Symposium on Cloud Computing, SoCC 2025, Virtual, Online, United States of America, November 19-21, 2025 (pp. 59-73). Association for Computing Machinery, Inc
Open this publication in new window or tab >>ParaLog: Consistent Host-side Logging for Parallel Checkpoints
Show others...
2026 (English)In: SoCC 2025 - Proceedings of the 2025 ACM Symposium on Cloud Computing, Association for Computing Machinery, Inc , 2026, p. 59-73Conference paper, Published paper (Refereed)
Abstract [en]

Output-intensive scientific applications are highly sensitive to low storage throughput. While existing scientific application stacks are optimized for traditional High-Performance Computing (HPC) environments with high remote storage and network bandwidth, these assumptions often fail in modern settings like cloud deployment. This is because the existing scientific application I/O stack fails to leverage the available resources. At the same time, scientific applications exhibit special synchronization and data output requirements that are difficult to satisfy using traditional approaches such as block-level or filesystem-level caching. We introduce ParaLog, a distributed host-side logging approach designed to accelerate scientific applications transparently. ParaLog emphasizes deployability, enabling support for unmodified message passing interface (MPI) applications and implementations while preserving crash consistency semantics. We evaluate ParaLog across traditional HPC, cloud HPC, local clusters, and hybrid environments, demonstrating its capability to reduce end-to-end execution time by 13-26% for popular scientific applications in cloud settings.

Place, publisher, year, edition, pages
Association for Computing Machinery, Inc, 2026
Keywords
burst buffer, caching, Cloud Computing, High Performance Computing, parallel IO, S3, scientific applications
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:kth:diva-376725 (URN)10.1145/3772052.3772212 (DOI)001697656400005 ()2-s2.0-105028598983 (Scopus ID)
Conference
2025 ACM Symposium on Cloud Computing, SoCC 2025, Virtual, Online, United States of America, November 19-21, 2025
Note

Part of ISBN 9798400722769

QC 20260213

Available from: 2026-02-13 Created: 2026-02-13 Last updated: 2026-05-29Bibliographically approved
Jansson, N., Karp, M., Páll, S., Markidis, S. & Schlatter, P. (2026). Task-decomposed Overlapped Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems. In: Proceedings of Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026: . Paper presented at Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026, Osaka, Japan, January 26-29, 2026 (pp. 186-193). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Task-decomposed Overlapped Preconditioner for Sustained Strong Scalability on Accelerated Exascale Systems
Show others...
2026 (English)In: Proceedings of Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026, Association for Computing Machinery (ACM) , 2026, p. 186-193Conference paper, Published paper (Refereed)
Abstract [en]

We detail our work on improving the performance and scalability of key numerical methods in the high-fidelity spectral element code Neko on accelerated exascale machines. Eifficient preconditioners are essential in incompressible fluid dynamics; however, the most eifficient method (with respect to convergence) might be challenging to implement with good performance on an accelerator. We present our development of a GPU-optimised preconditioner with task overlapping for the pressure-Poisson equation, improving the preconditioner's throughput (in TDoF/s) by close to 60%. The new preconditioner is explained in detail, together with detailed performance studies on accelerated Cray EX platforms, including strong scalability studies on LUMI and Frontier.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2026
Keywords
Accelerators, Direct numerical simulation, Spectral element method
National Category
Computational Mathematics Computer Sciences Fluid Mechanics
Identifiers
urn:nbn:se:kth:diva-378758 (URN)10.1145/3773656.3773690 (DOI)2-s2.0-105031765001 (Scopus ID)
Conference
Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, SCA/HPCAsia 2026, Osaka, Japan, January 26-29, 2026
Note

Part of ISBN 9798400720673

QC 20260330

Available from: 2026-03-30 Created: 2026-03-30 Last updated: 2026-03-30Bibliographically approved
Jansson, N., Karp, M., Wahlgren, J., Markidis, S. & Schlatter, P. (2025). Design of Neko—A Scalable High‐Fidelity Simulation Framework With Extensive Accelerator Support. Concurrency and Computation, 37(2), Article ID e8340.
Open this publication in new window or tab >>Design of Neko—A Scalable High‐Fidelity Simulation Framework With Extensive Accelerator Support
Show others...
2025 (English)In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 37, no 2, article id e8340Article in journal (Refereed) Published
Abstract [en]

Recent trends and advancements in including more diverse and heterogeneous hardware in High-Performance Computing (HPC) are challenging scientific software developers in their pursuit of efficient numerical methods with sustained performance across a diverse set of platforms. As a result, researchers are today forced to re-factor their codes to leverage these powerful new heterogeneous systems. We present our design considerations of Neko—a portable framework for high-fidelity spectral element flow simulations. Unlike prior works, Neko adopts a modern object-oriented Fortran 2008 approach, allowing multi-tier abstractions of the solver stack and facilitating various hardware backends ranging from general-purpose processors, accelerators down to exotic vector processors and Field-Programmable Gate Arrays (FPGAs). Focusing on the performance and portability of Neko, we describe the framework's device abstraction layer managing device memory, data transfer and kernel launches from Fortran, allowing for a solver written in a hardware-neutral yet performant way. Accelerator-specific optimizations are also discussed, with auto-tuning of key kernels and various communication strategies using device-aware MPI. Finally, we present performance measurements on a wide range of computing platforms, including the EuroHPC pre-exascale system LUMI, where Neko achieves excellent parallel efficiency for a large direct numerical simulation (DNS) of turbulent fluid flow using up to 80% of the entire LUMI supercomputer.

Place, publisher, year, edition, pages
Wiley, 2025
National Category
Computational Mathematics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358042 (URN)10.1002/cpe.8340 (DOI)001387473600001 ()2-s2.0-85213688601 (Scopus ID)
Funder
Swedish Research Council, 2019‐04723Swedish e‐Science Research Center, SESSIEU, Horizon Europe, 101093393
Note

QC 20250122

Available from: 2025-01-03 Created: 2025-01-03 Last updated: 2025-01-22Bibliographically approved
Stanly, R., Bagheri, E., Peplinski, A., Toosi, S., Jansson, N., Mukha, T. & Schlatter, P. (2025). Direct numerical simulation of a starting rotorat Rec = 15000. Journal of Visualization, 28(6), 1083-1090
Open this publication in new window or tab >>Direct numerical simulation of a starting rotorat Rec = 15000
Show others...
2025 (English)In: Journal of Visualization, ISSN 1343-8875, E-ISSN 1875-8975, Vol. 28, no 6, p. 1083-1090Article in journal (Refereed) Published
Abstract [en]

Rotors play a major role in various applications including ventilation and propulsion systems such as in helicopters, drones, gas turbines and wind turbines. This visualization of instantaneous vortical structures (identified by the k2 criterion) shows complex flow structures emanating from a twisted drone rotor that is impulsively starting to rotate at 1600 rpm. Initially, a starting vortex is formed as a result of lift generation and shed as a connected vortex tube from the entire surface of the blade, which has a strong connection to the blade tip via the so-called tip vortex. Leading edge separation occurs at span positions of high twist, followed by wave-induced breakdown to turbulence along the whole wing span. This turbulence then sheds as small-scale vortices into the wake and dissipates. Understanding the behaviour of these vortices from such complex blades and how they interact with the other blade is critical to design more efficient and potentially more silent propellers.

Place, publisher, year, edition, pages
Springer Nature, 2025
Keywords
Drone rotor, Adaptive mesh refinement, Spectral element method, Leading edge vortex, Tip vortex, Propeller
National Category
Fluid Mechanics
Identifiers
urn:nbn:se:kth:diva-371692 (URN)10.1007/s12650-025-01085-2 (DOI)001590961400001 ()2-s2.0-105018690007 (Scopus ID)
Funder
KTH Royal Institute of Technology
Note

QC 20260123

Available from: 2025-10-16 Created: 2025-10-16 Last updated: 2026-01-23Bibliographically approved
Karp, M., Suarez, E., Meinke, J. H., Andersson, M. I., Schlatter, P., Markidis, S. & Jansson, N. (2025). Experience and analysis of scalable high-fidelity computational fluid dynamics on modular supercomputing architectures. The international journal of high performance computing applications, 39(3), 329-344
Open this publication in new window or tab >>Experience and analysis of scalable high-fidelity computational fluid dynamics on modular supercomputing architectures
Show others...
2025 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 39, no 3, p. 329-344Article in journal (Refereed) Published
Abstract [en]

The never-ending computational demand from simulations of turbulence makes computational fluid dynamics (CFD) a prime application use case for current and future exascale systems. High-order finite element methods, such as the spectral element method, have been gaining traction as they offer high performance on both multicore CPUs and modern GPU-based accelerators. In this work, we assess how high-fidelity CFD using the spectral element method can exploit the modular supercomputing architecture at scale through domain partitioning, where the computational domain is split between a Booster module powered by GPUs and a Cluster module with conventional CPU nodes. We investigate several different flow cases and computer systems based on the Modular Supercomputing Architecture (MSA). We observe that for our simulations, the communication overhead and load balancing issues incurred by incorporating different computing architectures are seldom worthwhile, especially when I/O is also considered, but when the simulation at hand requires more than the combined global memory on the GPUs, utilizing additional CPUs to increase the available memory can be fruitful. We support our results with a simple performance model to assess when running across modules might be beneficial. As MSA is becoming more widespread and efforts to increase system utilization are growing more important our results give insight into when and how a monolithic application can utilize and spread out to more than one module and obtain a faster time to solution.

Place, publisher, year, edition, pages
SAGE Publications, 2025
National Category
Computer Sciences Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-358044 (URN)10.1177/10943420241303163 (DOI)001366656300001 ()2-s2.0-105003765421 (Scopus ID)
Funder
Swedish Research Council, 2019-04723Swedish e‐Science Research Center, SESSIEU, Horizon 2020, 955606
Note

QC 20260123

Available from: 2025-01-03 Created: 2025-01-03 Last updated: 2026-01-23Bibliographically approved
Andersson, M., Karp, M., Jansson, N. & Markidis, S. (2025). Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe. In: Proceedings 31st European Conference on Parallel and Distributed Processing: Heteropar 202523RD International Workshop. Paper presented at 31st European Conference on Parallel and Distributed Processing, Dresden, Germany, August 25–29, 2025. Springer
Open this publication in new window or tab >>Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe
2025 (English)In: Proceedings 31st European Conference on Parallel and Distributed Processing: Heteropar 202523RD International Workshop, Springer , 2025Conference paper, Published paper (Refereed)
Abstract [en]

With the emergence of new high-performance computing (HPC) accelerators, such as Nvidia and AMD GPUs, efficiently targeting diverse hardware architectures has become a major challenge for HPC application developers. The increasing hardware diversity in HPC systems often necessitates the development of architecture-specific code, hindering the sustainability of large-scale scientific applications. In this work, we leverage DaCe, a data-centric parallel programming framework, to automate the generation of high-performance kernels. DaCe enables automatic code generation for multicore processors and various accelerators, reducing the burden on developers who would otherwise need to rewrite code for each new architecture. Our study demonstrates DaCe's capabilities by applying its automatic code generation to a critical computational kernel used in Computational Fluid Dynamics (CFD). Specifically, we focus on Neko, a Fortran-based solver that employs the spectral-element method, which relies on small tensor operations. We detail the formulation of this computational kernel using DaCe's Stateful Dataflow Multigraph (SDFG) representation and discuss how this approach facilitates high-performance code generation. Additionally, we outline the workflow for seamlessly integrating DaCe's generated code into the Neko solver. Our results highlight the portability and performance of the generated code across multiple platforms, including Nvidia GH200, Nvidia A100, and AMD MI250X GPUs, with competitive performance results. By demonstrating the potential of automatic code generation, we emphasize the feasibility of using portable solutions to ensure the long-term sustainability of large-scale scientific applications. 

Place, publisher, year, edition, pages
Springer, 2025
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-368966 (URN)
Conference
31st European Conference on Parallel and Distributed Processing, Dresden, Germany, August 25–29, 2025
Note

QC 20251204

Available from: 2025-08-23 Created: 2025-08-23 Last updated: 2025-12-04Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-5020-1631

Search in DiVA

Show all publications