kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 235) Show all publications
Hegde, P. R., Marcandelli, P., He, Y., Pennati, L., Williams, J. J., Peng, I. B. & Markidis, S. (2026). A hybrid quantum-classical particle-in-cell method for plasma simulations. Future Generation Computer Systems, 175, Article ID 108087.
Open this publication in new window or tab >>A hybrid quantum-classical particle-in-cell method for plasma simulations
Show others...
2026 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 175, article id 108087Article in journal (Refereed) Published
Abstract [en]

We present a hybrid quantum-classical electrostatic Particle-in-Cell (PIC) method, where the electrostatic field Poisson solver is implemented on a quantum computer simulator using a hybrid classical-quantum Neural Network (HNN) using data-driven and physics-informed learning approaches. The HNN is trained on classical PIC simulation results and executed via a PennyLane quantum simulator. The remaining computational steps, including particle motion and field interpolation, are performed on a classical system. To evaluate the accuracy and computational cost of this hybrid approach, we test the hybrid quantum-classical electrostatic PIC against the two-stream instability, a standard benchmark in plasma physics. Our results show that the quantum Poisson solver achieves comparable accuracy to classical methods. It also provides insights into the feasibility of using quantum computing and HNNs for plasma simulations. We also discuss the computational overhead associated with current quantum computer simulators, showing the challenges and potential advantages of hybrid quantum-classical numerical methods.

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Hybrid Quantum-Classical Computing, Particle-in-Cell (PIC) Method, Electrostatic Poisson Solver, Quantum Neural Networks (QNNs)
National Category
Fusion, Plasma and Space Physics Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-368973 (URN)10.1016/j.future.2025.108087 (DOI)001561183000001 ()2-s2.0-105013835560 (Scopus ID)
Note

QC 20250825

Available from: 2025-08-25 Created: 2025-08-25 Last updated: 2025-09-17Bibliographically approved
Perez Martinez, A., Rezaeiravesh, S., Ju, Y., Laure, E., Markidis, S. & Schlatter, P. (2026). Compression of turbulence time series data using Gaussian process regression. Computer Physics Communications, 319, Article ID 109914.
Open this publication in new window or tab >>Compression of turbulence time series data using Gaussian process regression
Show others...
2026 (English)In: Computer Physics Communications, ISSN 0010-4655, E-ISSN 1879-2944, Vol. 319, article id 109914Article in journal (Refereed) Published
Abstract [en]

Turbulence data sets produced from computational fluid dynamics (CFD), especially from fine-resolved direct numerical simulations (DNS) and large eddy simulations (LES) of turbulent flows, tend to be very large due to high resolutions adopted to accurately resolve the smallest scales. While the computational capacity of high-performance computing (HPC) platforms has kept increasing, storage capacity has lagged to the point that more data is being produced than what can be efficiently managed. Among the several methods emerged to deal with this problem, an efficient technique is data compression. In this study, we present a proof of concept of a novel data compression approach that relies on Gaussian process regression (GPR) within a Bayesian framework to handle data sets in such a way that initially discarded information can be recovered a posteriori. The approach can be used to supplement existing compression algorithms with measures of uncertainty and we show that it can be applied to compress not only the 3D spatial fields of turbulence but also the discrete sets of time series data. The compression algorithm has been designed for data from spectral element method (SEM) simulations but can be extended to spatiotemporal fields obtained from other methods arising in engineering and physics. Our investigation shows that it is possible to use Gaussian process regression for data compression, however also highlights several of its limitations, in particular, that efficient implementations of GPR are crucial for its adoption, and that, while it is unlikely that the method can compete in terms of throughput with state of the art methods, given the cost of GPR, there is potential in terms of compression performance, as long as efficient bit-plane coding is integrated.

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Data compression, Gaussian processes, Time series, Turbulence
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-373150 (URN)10.1016/j.cpc.2025.109914 (DOI)2-s2.0-105021353415 (Scopus ID)
Note

Not duplicate with DiVA 2005565

QC 20251121

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-11-21Bibliographically approved
Markidis, S., Grandinetti, L. & Taufer, M. (2026). Editorial on future generation computer systems (FGCS) special collection on advances in quantum computing: methods, algorithms, and systems Vol II. Future Generation Computer Systems, 174, Article ID 107993.
Open this publication in new window or tab >>Editorial on future generation computer systems (FGCS) special collection on advances in quantum computing: methods, algorithms, and systems Vol II
2026 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 174, article id 107993Article in journal, Editorial material (Other academic) Published
Place, publisher, year, edition, pages
Elsevier BV, 2026
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-372818 (URN)10.1016/j.future.2025.107993 (DOI)001537992700001 ()2-s2.0-105010096933 (Scopus ID)
Note

QC 20251119

Available from: 2025-11-19 Created: 2025-11-19 Last updated: 2025-11-19Bibliographically approved
Karp, M., Stanly, R., Mukha, T., Galimberti, L., Toosi, S., Song, H., . . . Schlatter, P. (2026). Effects of lower floating-point precision on scale-resolving numerical simulations of turbulence. Journal of Computational Physics, 549, Article ID 114600.
Open this publication in new window or tab >>Effects of lower floating-point precision on scale-resolving numerical simulations of turbulence
Show others...
2026 (English)In: Journal of Computational Physics, ISSN 0021-9991, E-ISSN 1090-2716, Vol. 549, article id 114600Article in journal (Refereed) Published
Abstract [en]

Modern computing clusters offer specialized hardware for reduced-precision arithmetic, which can significantly speed up the time to solution. This is possible due to a decrease in data movement, as well as the ability to perform arithmetic operations at a faster rate. However, for high-fidelity simulations of turbulence, such as direct and large-eddy simulation, the impact of reduced precision on the computed solution and the resulting uncertainty across flow solvers and different flow cases has not been explored in detail, and limits the optimal utilization of new high-performance computing systems. In this work, the effect of reduced precision is studied using four diverse computational fluid dynamics (CFD) solvers (two incompressible, Neko and Simson, and two compressible, PadeLibs and SSDC) using four test cases: turbulent channel flow at Reτ=550 and higher, forced transition in a channel, flow over a cylinder at ReD=3900, and compressible flow over a wing section at Rec=50000. We observe that the flow physics are remarkably robust with respect to reductions in lower floating-point precision, and that often other forms of uncertainty, due to, for example, time averaging, often have a much larger impact on the computed result. Our results indicate that different terms in the Navier–Stokes equations can be computed to a lower floating-point accuracy without affecting the results. In particular, standard IEEE single precision can be used effectively for the entirety of the simulation, showing no significant discrepancies from double-precision results across the solvers and cases considered. Potential pitfalls are also discussed. 

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Computational fluid dynamics, Direct numerical simulation, Floating-point precision, Turbulence
National Category
Fluid Mechanics Computational Mathematics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-375324 (URN)10.1016/j.jcp.2025.114600 (DOI)001654296600002 ()2-s2.0-105025717580 (Scopus ID)
Note

Not duplicate with DiVA 2002138

QC 20260112

Available from: 2026-01-12 Created: 2026-01-12 Last updated: 2026-01-12Bibliographically approved
Williams, J. J., Costea, S., Araújo De Medeiros, D., Trilaksono, J., Hegde, P. R., Tskhakaya, D., . . . Markidis, S. (2026). Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+OpenMP PIC MC Simulations Towards Exascale. The international journal of high performance computing applications
Open this publication in new window or tab >>Integrating High Performance In-Memory Data Streaming and In-Situ Visualization in Hybrid MPI+OpenMP PIC MC Simulations Towards Exascale
Show others...
2026 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846Article in journal (Refereed) Published
Abstract [en]

Efficient simulation of complex plasma dynamics is crucial for advancing fusion energy research. Particle-in-Cell (PIC) Monte Carlo (MC) simulations provide insights into plasma behavior, including turbulence and confinement, which are essential for optimizing fusion reactor performance. Transitioning to exascale simulations introduces significant challenges, with traditional file input/output (I/O) inefficiencies remaining a key bottleneck. This work advances BIT1, an electrostatic PIC MC code, by improving the particle mover with OpenMP task-based parallelism, integrating openPMD’s streaming API, and enabling in-memory data streaming with the ADIOS2 Sustainable Staging Transport (SST) engine to enhance I/O performance, computational efficiency, and system storage utilization. We employ profiling tools such as gprof, perf, IPM and Darshan, which provide insights into computation, communication, and I/O operations. We implement time-dependent data checkpointing with the openPMD API enabling seamless data movement and in-situ visualization for real-time analysis without interrupting the simulation. We demonstrate improvements in simulation runtime, data accessibility and real-time insights by comparing traditional file I/O with the ADIOS2 BP4 and SST backends. The proposed hybrid BIT1 openPMD SST enhancement introduces a new paradigm for real-time scientific discovery in plasma simulations, enabling faster insights and more efficient use of exascale computing resources.

Place, publisher, year, edition, pages
London, United Kingdom: Sage Publications, 2026
Keywords
Hybrid MPI+OMP Parallel Programming, openPMD, ADIOS2, In-Memory Data Streaming, In-situ Visualization, Distributed Computing, Efficient Data Processing, Large-Scale PIC MC Simulations
National Category
Fusion, Plasma and Space Physics Computer Sciences Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-373650 (URN)10.1177/10943420251409229 (DOI)
Note

QC 20251212

Available from: 2025-12-04 Created: 2025-12-04 Last updated: 2026-01-23
Netzer, G., Hegde, P. R., Peng, I. B. & Markidis, S. (2026). Tightly-integrated quantum–classical computing using the QHDL hardware description language. Future Generation Computer Systems, 174, Article ID 107977.
Open this publication in new window or tab >>Tightly-integrated quantum–classical computing using the QHDL hardware description language
2026 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 174, article id 107977Article in journal (Refereed) Published
Abstract [en]

We present the design, development, and application of QHDL, a quantum hardware description language specifically designed for tightly-coupled quantum–classical computing systems. Together with the language design principles, we describe the QHDL compiler, debugger, and co-simulation infrastructure. We showcase the benefits of using a quantum–classical integrated approach in four use cases, requiring close quantum–classical device interaction: Bell's pair circuit, dynamic delay, Quantum Fourier Transform (QFT), and teleportation. To interface with QHDL, we propose to use synchronous techniques that are commonplace in digital hardware design. We illustrate examples of modeling both loosely-coupled and tightly-coupled quantum circuits that use so-called measurement-in-the-middle by utilizing these techniques in QHDL. For clock-cycle accurate implementations, we propose implementing such classical modules as programmable hardware blocks using Register-Transfer Level (RTL) or gate-level approaches. These approaches provide the highest coupling performance and are feasible to be implemented in state-of-the-art control systems.

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Classical feedback, Hardware description languages, Quantum circuits, Quantum computing, Quantum software stack, Quantum–classical algorithms
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-368837 (URN)10.1016/j.future.2025.107977 (DOI)001524931100003 ()2-s2.0-105009494290 (Scopus ID)
Note

QC 20250902

Available from: 2025-09-02 Created: 2025-09-02 Last updated: 2025-09-02Bibliographically approved
Williams, J. J., Liu, F., Trilaksono, J., Tskhakaya, D., Costea, S., Kos, L., . . . Markidis, S. (2025). Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming. Journal of Computational Science, 88, Article ID 102590.
Open this publication in new window or tab >>Accelerating Particle-in-Cell Monte Carlo Simulations with MPI, OpenMP/OpenACC and Asynchronous Multi-GPU Programming
Show others...
2025 (English)In: Journal of Computational Science, ISSN 1877-7503, E-ISSN 1877-7511, Vol. 88, article id 102590Article in journal (Refereed) Published
Abstract [en]

As fusion energy devices advance, plasma simulations play a critical role in fusion reactor design. Particle-in-Cell Monte Carlo simulations are essential for modelling plasma-material interactions and analysing power load distributions on tokamak divertors. Previous work introduced hybrid parallelization in BIT1 using MPI and OpenMP/OpenACC for shared-memory and multicore CPU processing. In this extended work, we integrate MPI with OpenMP and OpenACC, focusing on asynchronous multi-GPU programming with OpenMP Target Tasks using the "nowait" and "depend" clauses, and OpenACC Parallel with the "async(n)" clause. Our results show significant performance improvements: 16 MPI ranks plus OpenMP threads reduced simulation runtime by 53% on a petascale EuroHPC supercomputer, while the OpenACC multicore implementation achieved a 58% reduction compared to the MPI-only version. Scaling to 64 MPI ranks, OpenACC outperformed OpenMP, achieving a 24% improvement in the particle mover function. On the HPE Cray EX supercomputer, OpenMP and OpenACC consistently reduced simulation times, with a 37% reduction at 100 nodes. Results from MareNostrum 5, a pre-exascale EuroHPC supercomputer, highlight OpenACC's effectiveness, with the "async(n)" configuration delivering notable performance gains. However, OpenMP asynchronous configurations outperform OpenACC at larger node counts, particularly for extreme scaling runs. As BIT1 scales asynchronously to 128 GPUs, OpenMP asynchronous multi-GPU configurations outperformed OpenACC in runtime, demonstrating superior scalability, which continues up to 400 GPUs, further improving runtime. Speedup and parallel efficiency (PE) studies reveal OpenMP asynchronous multi-GPU achieving an 8.77x speedup (54.81% PE) and OpenACC achieving an 8.14x speedup (50.87% PE) on MareNostrum 5, surpassing the CPU-only version. At higher node counts, PE declined across all implementations due to communication and synchronization costs. However, the asynchronous multi-GPU versions maintained better PE, demonstrating the benefits of asynchronous multi-GPU execution in reducing scalability bottlenecks. While the CPU-only implementation is faster in some cases, OpenMP's asynchronous multi-GPU approach delivers better GPU performance through asynchronous data transfer and task dependencies, ensuring data consistency and avoiding race conditions. Using NVIDIA Nsight tools, we confirmed BIT1's overall efficiency for large-scale plasma simulations, leveraging current and future exascale supercomputing infrastructures. Asynchronous data transfers and dedicated GPU assignments to MPI ranks enhance performance, with OpenMP’s asynchronous multi-GPU implementation utilizing OpenMP Target Tasks with "nowait" and "depend" clauses outperforming other configurations. This makes OpenMP the preferred application programming interface when performance portability, high throughput, and efficient GPU utilization are critical. This enables BIT1 to fully exploit modern supercomputing architectures, advancing fusion energy research. MareNostrum 5 brings us closer to achieving exascale performance.

Place, publisher, year, edition, pages
Netherlands: Elsevier BV, 2025
Keywords
Hybrid Programming, OpenMP, Task-Based Parallelism, Dependency Management, OpenACC, Asynchronous Execution, Multi-GPU Offloading, Overlapping Kernels, Large-Scale PIC Simulations
National Category
Computer Systems Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-362742 (URN)10.1016/j.jocs.2025.102590 (DOI)001482576300001 ()2-s2.0-105003577843 (Scopus ID)
Funder
Swedish Research Council, 2022-06725KTH Royal Institute of Technology, 101093261
Note

QC 20250619

Available from: 2025-04-24 Created: 2025-04-24 Last updated: 2025-06-19Bibliographically approved
Hübner, P., Hu, A., Peng, I. & Markidis, S. (2025). Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency. In: Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025: . Paper presented at 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Milan, Italy, June 3-7, 2025 (pp. 45-54). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency
2025 (English)In: Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 45-54Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates the architectural features and performance potential of the Apple Silicon M-Series SoCs (M1, M2, M3, and M4) for HPC. We provide a detailed review of the CPU and GPU designs, the unified memory architecture, and coprocessors such as Advanced Matrix Extensions (AMX). We design and develop benchmarks in the Metal Shading Language and Objective-C++ to assess FP32 computational and memory performance. We also measure power consumption and efficiency using Apple's powermetrics tool. Our results show that the M-Series chips offer up to 100 GB/s memory bandwidth, and significant generational improvements in computational performance, with up to 2.9 FP32 TFLOPS on the M4. Power consumption varies from a few Watts to 10-20 Watts, with more than 200 GFLOPS per Watt efficiency of GPU and accelerator reached by all four chips. Despite limitations in FP64 support on the GPU, the M-Series chips demonstrate strong potential for energy-efficient HPC applications. While existing HPC solutions such as the Nvidia Grace-Hopper superchip outperform Apple Silicon in both memory bandwidth and computational performance, we see that the M-Series provides a competitive power-efficient alternative to traditional HPC architectures and represents a distinct category altogether - forming an apples-to-oranges comparison.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Keywords
Apple Silicon M-Series GPU Performance, ARM-based SoC, M1, M2, M3, M4 Architecture
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-370765 (URN)10.1109/IPDPSW66978.2025.00013 (DOI)2-s2.0-105015528421 (Scopus ID)
Conference
2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Milan, Italy, June 3-7, 2025
Note

Part of ISBN 9798331526436

QC 20251001

Available from: 2025-10-01 Created: 2025-10-01 Last updated: 2025-10-01Bibliographically approved
Bragone, F., Morozovska, K., Rosén, T., Laneryd, T., Söderberg, D. & Markidis, S. (2025). Automatic learning analysis of flow-induced birefringence in cellulose nanofibrils. Journal of Computational Science, 85, Article ID 102536.
Open this publication in new window or tab >>Automatic learning analysis of flow-induced birefringence in cellulose nanofibrils
Show others...
2025 (English)In: Journal of Computational Science, ISSN 1877-7503, E-ISSN 1877-7511, Vol. 85, article id 102536Article in journal (Refereed) Published
Abstract [en]

Cellulose Nanofibrils (CNFs), highly present in nature, can be used as building blocks for future sustainable materials, including strong and stiff filaments. A rheo-optical flow-stop technique is used to conduct experiments to characterize the CNFs by studying Brownian dynamics through the CNFs' birefringence decay after stop. As the experiments produce large quantities of data, we reduce their dimensionality using Principal Component Analysis (PCA) and exploit the possibility of visualizing the reduced data in two ways. First, we plot the principal components (PCs) as time series, and by training LSTM networks assigned for each PC time series with the data before the flow stop, we predict the behavior after the flow stop (Bragone et al., 2024). Second, we plot the first PCs against each other to create clusters that give information about the different CNF materials and concentrations. Our approach aims at classifying the CNF materials to varying concentrations by applying unsupervised machine learning algorithms, such as k-means and Gaussian Mixture Models (GMMs). Finally, we analyze the Autocorrelation Function (ACF) and the Partial Autocorrelation Function (PACF) of the first principal component, detecting seasonality in lower concentrations.

Place, publisher, year, edition, pages
Elsevier BV, 2025
Keywords
Cellulose nanofibrils, Principal component analysis, Long short-term memory, k-means, Gaussian mixture models
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-360732 (URN)10.1016/j.jocs.2025.102536 (DOI)001425378400001 ()2-s2.0-85217011665 (Scopus ID)
Note

QC 20250303

Available from: 2025-03-03 Created: 2025-03-03 Last updated: 2025-05-02Bibliographically approved
Jansson, N., Karp, M., Wahlgren, J., Markidis, S. & Schlatter, P. (2025). Design of Neko—A Scalable High‐Fidelity Simulation Framework With Extensive Accelerator Support. Concurrency and Computation, 37(2), Article ID e8340.
Open this publication in new window or tab >>Design of Neko—A Scalable High‐Fidelity Simulation Framework With Extensive Accelerator Support
Show others...
2025 (English)In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 37, no 2, article id e8340Article in journal (Refereed) Published
Abstract [en]

Recent trends and advancements in including more diverse and heterogeneous hardware in High-Performance Computing (HPC) are challenging scientific software developers in their pursuit of efficient numerical methods with sustained performance across a diverse set of platforms. As a result, researchers are today forced to re-factor their codes to leverage these powerful new heterogeneous systems. We present our design considerations of Neko—a portable framework for high-fidelity spectral element flow simulations. Unlike prior works, Neko adopts a modern object-oriented Fortran 2008 approach, allowing multi-tier abstractions of the solver stack and facilitating various hardware backends ranging from general-purpose processors, accelerators down to exotic vector processors and Field-Programmable Gate Arrays (FPGAs). Focusing on the performance and portability of Neko, we describe the framework's device abstraction layer managing device memory, data transfer and kernel launches from Fortran, allowing for a solver written in a hardware-neutral yet performant way. Accelerator-specific optimizations are also discussed, with auto-tuning of key kernels and various communication strategies using device-aware MPI. Finally, we present performance measurements on a wide range of computing platforms, including the EuroHPC pre-exascale system LUMI, where Neko achieves excellent parallel efficiency for a large direct numerical simulation (DNS) of turbulent fluid flow using up to 80% of the entire LUMI supercomputer.

Place, publisher, year, edition, pages
Wiley, 2025
National Category
Computational Mathematics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358042 (URN)10.1002/cpe.8340 (DOI)001387473600001 ()2-s2.0-85213688601 (Scopus ID)
Funder
Swedish Research Council, 2019‐04723Swedish e‐Science Research Center, SESSIEU, Horizon Europe, 101093393
Note

QC 20250122

Available from: 2025-01-03 Created: 2025-01-03 Last updated: 2025-01-22Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0639-0639

Search in DiVA

Show all publications