kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 61) Show all publications
Hegde, P. R., Marcandelli, P., He, Y., Pennati, L., Williams, J. J., Peng, I. B. & Markidis, S. (2026). A hybrid quantum-classical particle-in-cell method for plasma simulations. Future Generation Computer Systems, 175, Article ID 108087.
Open this publication in new window or tab >>A hybrid quantum-classical particle-in-cell method for plasma simulations
Show others...
2026 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 175, article id 108087Article in journal (Refereed) Published
Abstract [en]

We present a hybrid quantum-classical electrostatic Particle-in-Cell (PIC) method, where the electrostatic field Poisson solver is implemented on a quantum computer simulator using a hybrid classical-quantum Neural Network (HNN) using data-driven and physics-informed learning approaches. The HNN is trained on classical PIC simulation results and executed via a PennyLane quantum simulator. The remaining computational steps, including particle motion and field interpolation, are performed on a classical system. To evaluate the accuracy and computational cost of this hybrid approach, we test the hybrid quantum-classical electrostatic PIC against the two-stream instability, a standard benchmark in plasma physics. Our results show that the quantum Poisson solver achieves comparable accuracy to classical methods. It also provides insights into the feasibility of using quantum computing and HNNs for plasma simulations. We also discuss the computational overhead associated with current quantum computer simulators, showing the challenges and potential advantages of hybrid quantum-classical numerical methods.

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Hybrid Quantum-Classical Computing, Particle-in-Cell (PIC) Method, Electrostatic Poisson Solver, Quantum Neural Networks (QNNs)
National Category
Fusion, Plasma and Space Physics Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-368973 (URN)10.1016/j.future.2025.108087 (DOI)001561183000001 ()2-s2.0-105013835560 (Scopus ID)
Note

QC 20250825

Available from: 2025-08-25 Created: 2025-08-25 Last updated: 2025-09-17Bibliographically approved
Shi, R., Schieffer, G., Gokhale, M., Lin, P. H., Patel, H. & Peng, B. (2026). ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace. In: Euro-Par 2025: Parallel Processing - 31st European Conference on Parallel and Distributed Processing, Proceedings: . Paper presented at 31st International Conference on Parallel and Distributed Computing, Euro-Par 2025, Dresden, Germany, August 25-29, 2025 (pp. 33-47). Springer Nature
Open this publication in new window or tab >>ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
Show others...
2026 (English)In: Euro-Par 2025: Parallel Processing - 31st European Conference on Parallel and Distributed Processing, Proceedings, Springer Nature , 2026, p. 33-47Conference paper, Published paper (Refereed)
Abstract [en]

Vector architectures are essential for boosting computing throughput. ARM provides SVE as the next-generation length-agnostic vector extension beyond traditional fixed-length SIMD. This work provides a first study of the maturity and readiness of exploiting ARM and SVE in HPC. Using selected performance hardware events on the ARM Grace processor and analytical models, we derive new metrics to quantify the effectiveness of exploiting SVE vectorization to reduce executed instructions and improve performance speedup. We further propose an adapted roofline model that combines vector length and data elements to identify potential performance bottlenecks. Finally, we propose a decision tree for classifying the SVE-boosted performance in applications.

Place, publisher, year, edition, pages
Springer Nature, 2026
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:kth:diva-370460 (URN)10.1007/978-3-031-99857-7_3 (DOI)2-s2.0-105014494119 (Scopus ID)
Conference
31st International Conference on Parallel and Distributed Computing, Euro-Par 2025, Dresden, Germany, August 25-29, 2025
Note

Part of ISBN 9783031998560

QC 20250929

Available from: 2025-09-29 Created: 2025-09-29 Last updated: 2025-09-29Bibliographically approved
Netzer, G., Hegde, P. R., Peng, I. B. & Markidis, S. (2026). Tightly-integrated quantum–classical computing using the QHDL hardware description language. Future Generation Computer Systems, 174, Article ID 107977.
Open this publication in new window or tab >>Tightly-integrated quantum–classical computing using the QHDL hardware description language
2026 (English)In: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 174, article id 107977Article in journal (Refereed) Published
Abstract [en]

We present the design, development, and application of QHDL, a quantum hardware description language specifically designed for tightly-coupled quantum–classical computing systems. Together with the language design principles, we describe the QHDL compiler, debugger, and co-simulation infrastructure. We showcase the benefits of using a quantum–classical integrated approach in four use cases, requiring close quantum–classical device interaction: Bell's pair circuit, dynamic delay, Quantum Fourier Transform (QFT), and teleportation. To interface with QHDL, we propose to use synchronous techniques that are commonplace in digital hardware design. We illustrate examples of modeling both loosely-coupled and tightly-coupled quantum circuits that use so-called measurement-in-the-middle by utilizing these techniques in QHDL. For clock-cycle accurate implementations, we propose implementing such classical modules as programmable hardware blocks using Register-Transfer Level (RTL) or gate-level approaches. These approaches provide the highest coupling performance and are feasible to be implemented in state-of-the-art control systems.

Place, publisher, year, edition, pages
Elsevier BV, 2026
Keywords
Classical feedback, Hardware description languages, Quantum circuits, Quantum computing, Quantum software stack, Quantum–classical algorithms
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-368837 (URN)10.1016/j.future.2025.107977 (DOI)001524931100003 ()2-s2.0-105009494290 (Scopus ID)
Note

QC 20250902

Available from: 2025-09-02 Created: 2025-09-02 Last updated: 2025-09-02Bibliographically approved
Hübner, P., Hu, A., Peng, I. & Markidis, S. (2025). Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency. In: Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025: . Paper presented at 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Milan, Italy, June 3-7, 2025 (pp. 45-54). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency
2025 (English)In: Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 45-54Conference paper, Published paper (Refereed)
Abstract [en]

This paper investigates the architectural features and performance potential of the Apple Silicon M-Series SoCs (M1, M2, M3, and M4) for HPC. We provide a detailed review of the CPU and GPU designs, the unified memory architecture, and coprocessors such as Advanced Matrix Extensions (AMX). We design and develop benchmarks in the Metal Shading Language and Objective-C++ to assess FP32 computational and memory performance. We also measure power consumption and efficiency using Apple's powermetrics tool. Our results show that the M-Series chips offer up to 100 GB/s memory bandwidth, and significant generational improvements in computational performance, with up to 2.9 FP32 TFLOPS on the M4. Power consumption varies from a few Watts to 10-20 Watts, with more than 200 GFLOPS per Watt efficiency of GPU and accelerator reached by all four chips. Despite limitations in FP64 support on the GPU, the M-Series chips demonstrate strong potential for energy-efficient HPC applications. While existing HPC solutions such as the Nvidia Grace-Hopper superchip outperform Apple Silicon in both memory bandwidth and computational performance, we see that the M-Series provides a competitive power-efficient alternative to traditional HPC architectures and represents a distinct category altogether - forming an apples-to-oranges comparison.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Keywords
Apple Silicon M-Series GPU Performance, ARM-based SoC, M1, M2, M3, M4 Architecture
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-370765 (URN)10.1109/IPDPSW66978.2025.00013 (DOI)2-s2.0-105015528421 (Scopus ID)
Conference
2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Milan, Italy, June 3-7, 2025
Note

Part of ISBN 9798331526436

QC 20251001

Available from: 2025-10-01 Created: 2025-10-01 Last updated: 2025-10-01Bibliographically approved
Araújo De Medeiros, D., Williams, J. J., Wahlgren, J., Saud Maia Leite, L. & Peng, I. B. (2025). ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments. In: 31st International European Conference on Parallel and Distributed Computing: . Paper presented at The 31st International European Conference on Parallel and Distributed Computing (Euro-Par ’25), Dresden, Germany, 25-29 Aug, 2025. Springer Nature
Open this publication in new window or tab >>ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Show others...
2025 (English)In: 31st International European Conference on Parallel and Distributed Computing, Springer Nature , 2025Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Springer Nature, 2025
Keywords
Vertical scaling, HPC workloads, Cloud Computing, Resource Adaptivity, Memory Resource Provisioning
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363170 (URN)10.1007/978-3-031-99854-6_12 (DOI)2-s2.0-105015430232 (Scopus ID)
Conference
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par ’25), Dresden, Germany, 25-29 Aug, 2025
Note

QC 20250923

Available from: 2025-05-06 Created: 2025-05-06 Last updated: 2025-09-23Bibliographically approved
Peng, I. B., Wahlgren, J., Youssef, K., Iwabuchi, K., Pearce, R. & Gokhale, M. (2025). UMap: An application-oriented user level memory mapping library. The international journal of high performance computing applications, 39(2), 269-282
Open this publication in new window or tab >>UMap: An application-oriented user level memory mapping library
Show others...
2025 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 39, no 2, p. 269-282Article in journal (Refereed) Published
Abstract [en]

Exploiting the prominent role of complex memories in exascale node architecture, the UMap page fault handler offers new capabilities to access large memory-mapped data sets directly. UMap provides flexible configuration options to customize page handling to each application, including analysis of massive observational and simulation data sets. The high-performance design features I/O decoupling, dynamic load balancing, and application-level controls. Page faults triggered by application threads and processes accessing data mapped to a UMapp’ed region are handled via the Linux userfaultfd protocol, an asynchronous message-oriented kernel-user communication mechanism that avoids the context switch penalty of traditional signal fault handlers. UMap is fully open source. In this paper, we give an overview of the UMap library architecture, its extensible plugin architecture, and the use/performance of UMap in emerging heterogeneous memory hierarchies such as near-node Non-volatile Memory (NVM) and network attached memories. We highlight new capabilities in two pagefault management plugins, the NetworkStore and SparseStore. We demonstrate the integration between UMap and multiple ECP products including Caliper, Metall, ZFP, Mochi, and Ripples.

Place, publisher, year, edition, pages
SAGE Publications, 2025
Keywords
Memory mapping, mmap, page fault, page management, user-space paging, userfaultfd
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-362548 (URN)10.1177/10943420241303145 (DOI)001370462800001 ()2-s2.0-105002054781 (Scopus ID)
Note

QC 20250422

Available from: 2025-04-16 Created: 2025-04-16 Last updated: 2025-04-22Bibliographically approved
Araújo De Medeiros, D., Schieffer, G., Wahlgren, J. & Peng, I. B. (2025). Understanding Layered Portability from HPC to Cloud in Containerized Environments. In: Weiland, M Neuwirth, S Kruse, C Weinzierl, T (Ed.), Proceedings High Performance Computing. ISC High Performance 2024 International Workshops: . Paper presented at 39th International Conference of the ISC High Performance, MAY 12-16, 2024, Hamburg, GERMANY (pp. 439-452). Springer Nature, 15058
Open this publication in new window or tab >>Understanding Layered Portability from HPC to Cloud in Containerized Environments
2025 (English)In: Proceedings High Performance Computing. ISC High Performance 2024 International Workshops / [ed] Weiland, M Neuwirth, S Kruse, C Weinzierl, T, Springer Nature , 2025, Vol. 15058, p. 439-452Conference paper, Published paper (Refereed)
Abstract [en]

Recent development in lightweight OS-level virtualization, containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus on the impact of different layers in a containerized environment when migrating HPC containers from a dedicated HPC system to a cloud platform. On three ARM-based platforms, including the latest Nvidia Grace CPU, we use six representative HPC applications to characterize the impact of container virtualization, host OS and kernel, and rootless and privileged container execution. Our results indicate less than 4% container overhead in DGEMM, miniMD, and XSBench, but 8%-10% overhead in FFT, HPCG, and Hypre. We also show that changing between the container execution modes results in negligible performance differences in the six applications.

Place, publisher, year, edition, pages
Springer Nature, 2025
Series
Lecture Notes in Computer Science, ISSN 0302-9743
Keywords
Cloud and HPC Convergence, Containers, ARM, Performance
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-366126 (URN)10.1007/978-3-031-73716-9_31 (DOI)001463189500031 ()2-s2.0-105009319340 (Scopus ID)
Conference
39th International Conference of the ISC High Performance, MAY 12-16, 2024, Hamburg, GERMANY
Note

Part of ISBN 978-3-031-73715-2, 978-3-031-73716-9

QC 20250703

Available from: 2025-07-03 Created: 2025-07-03 Last updated: 2025-07-10Bibliographically approved
Hegde, P. R., Kyriienko, O., Heimonen, H., Tolias, P., Netzer, G., Barkoutsos, P., . . . Markidis, S. (2024). Beyond the Buzz: Strategic Paths for Enabling Useful NISQ Applications. In: Proceedings of the 21st ACM International Conference on Computing Frontiers, CF 2024: . Paper presented at 21st ACM International Conference on Computing Frontiers, CF 2024, Ischia, Italy, May 7 2024 - May 9 2024 (pp. 310-313). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Beyond the Buzz: Strategic Paths for Enabling Useful NISQ Applications
Show others...
2024 (English)In: Proceedings of the 21st ACM International Conference on Computing Frontiers, CF 2024, Association for Computing Machinery (ACM) , 2024, p. 310-313Conference paper, Published paper (Refereed)
Abstract [en]

There is much debate on whether quantum computing on current NISQ devices, consisting of noisy hundred qubits and requiring a non-negligible usage of classical computing as part of the algorithms, has utility and will ever offer advantages for scientific and industrial applications with respect to traditional computing. In this position paper, we argue that while real-world NISQ quantum applications have yet to surpass their classical counterparts, strategic approaches can be used to facilitate advancements in both industrial and scientific applications. We have identified three key strategies to guide NISQ computing towards practical and useful implementations. Firstly, prioritizing the identification of a "killer app"is a key point. An application demonstrating the distinctive capabilities of NISQ devices can catalyze broader development. We suggest focusing on applications that are inherently quantum, e.g., pointing towards quantum chemistry and material science as promising domains. These fields hold the potential to exhibit benefits, setting benchmarks for other applications to follow. Secondly, integrating AI and deep-learning methods into NISQ computing is a promising approach. Examples such as quantum Physics-Informed Neural Networks and Differentiable Quantum Circuits (DQC) demonstrate the synergy between quantum computing and AI. Lastly, recognizing the interdisciplinary nature of NISQ computing, we advocate for a co-design approach. Achieving synergy between classical and quantum computing necessitates an effort in co-designing quantum applications, algorithms, and programming environments, and the integration of HPC with quantum hardware. The interoperability of these components is crucial for enabling the full potential of NISQ computing. In conclusion, through the usage of these three approaches, we argue that NISQ computing can surpass current limitations and evolve into a valuable tool for scientific and industrial applications. This requires an approach that integrates domain-specific killer apps, harnesses the power of quantum-enhanced AI, and embraces a collaborative co-design methodology.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
AI & Quantum, Codesign, NISQ Computing, Quantum Applications
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:kth:diva-350989 (URN)10.1145/3649153.3649182 (DOI)001267265700037 ()2-s2.0-85198901369 (Scopus ID)
Conference
21st ACM International Conference on Computing Frontiers, CF 2024, Ischia, Italy, May 7 2024 - May 9 2024
Note

Part of ISBN 9798400705977

QC 20240725

Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2024-09-10Bibliographically approved
Schieffer, G., Pornthisan, N., Araújo De Medeiros, D., Markidis, S., Wahlgren, J. & Peng, I. B. (2024). Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU. In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers: . Paper presented at International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Aug 28 2023 - Sep 1 2023 Limassol, Cyprus (pp. 294-305). Springer Nature
Open this publication in new window or tab >>Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU
Show others...
2024 (English)In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers, Springer Nature , 2024, p. 294-305Conference paper, Published paper (Refereed)
Abstract [en]

High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on Nvidia V100, A100, A40 and T4 GPUs. To mitigate numerical instability and precision losses, we introduce algorithmic changes in the particle filters. Using half-precision leads to a performance improvement of 1.5–2 × and 2.5–4.6 × with respect to single- and double-precision baselines respectively, at the cost of a relatively small loss of accuracy.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
GPUs, Half-Precision, Particle Filter, Reduced Precision
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-346540 (URN)10.1007/978-3-031-50684-0_23 (DOI)001279250600023 ()2-s2.0-85192268315 (Scopus ID)
Conference
International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Aug 28 2023 - Sep 1 2023 Limassol, Cyprus
Note

Part of proceedings ISBN: 978-303150683-3

QC 20240520

Available from: 2024-05-16 Created: 2024-05-16 Last updated: 2024-09-10Bibliographically approved
Wahlgren, J., Schieffer, G., Gokhale, M., Pearce, R. & Peng, I. (2024). Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing. In: : . Paper presented at IEEE/SBC 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, Hawaii, November 13-15, 2024. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing
Show others...
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Disaggregated memory breaks the boundary of monolithic servers to enable memory provisioning on demand. Using network-attached memory to provide memory expansion for memory-intensive applications on compute nodes can improve the overall memory utilization on a cluster and reduce the total cost of ownership. However, current software solutions for leveraging network-attached memory must consume resources on the compute node for memory management tasks. Emerging off-path smartNICs provide general-purpose programmability at low-cost low-power cores. This work provides a general architecture design that enables network-attached memory and offloading tasks onto off-path programmable SmartNIC. We provide a prototype implementation called SODA on Nvidia BlueField DPU. SODA adapts communication paths and data transfer alternatives, pipelines data movement stages, and enables customizable data caching and prefetching optimizations. We evaluate SODA in five representative graph applications on real-world graphs. Our results show that SODA can achieve up to 7.9x speedup compared to node-local SSD and reduce network traffic by 42% compared to disaggregated memory without SmartNIC offloading at similar or better performance.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
smartnic, disaggregated memory, fabric-attached memory
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-326630 (URN)10.1109/SBAC-PAD63648.2024.00022 (DOI)2-s2.0-85212448438 (Scopus ID)
Conference
IEEE/SBC 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, Hawaii, November 13-15, 2024
Note

QC 20250113

Available from: 2023-05-08 Created: 2024-10-03 Last updated: 2025-01-13Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-4158-3583

Search in DiVA

Show all publications