kth.sePublikationer KTH
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Alternativa namn
Publikationer (10 of 61) Visa alla publikationer
Hegde, P. R., Marcandelli, P., He, Y., Pennati, L., Williams, J. J., Peng, I. B. & Markidis, S. (2026). A hybrid quantum-classical particle-in-cell method for plasma simulations. Future Generation Computer Systems, 175, Article ID 108087.
Öppna denna publikation i ny flik eller fönster >>A hybrid quantum-classical particle-in-cell method for plasma simulations
Visa övriga...
2026 (Engelska)Ingår i: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 175, artikel-id 108087Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

We present a hybrid quantum-classical electrostatic Particle-in-Cell (PIC) method, where the electrostatic field Poisson solver is implemented on a quantum computer simulator using a hybrid classical-quantum Neural Network (HNN) using data-driven and physics-informed learning approaches. The HNN is trained on classical PIC simulation results and executed via a PennyLane quantum simulator. The remaining computational steps, including particle motion and field interpolation, are performed on a classical system. To evaluate the accuracy and computational cost of this hybrid approach, we test the hybrid quantum-classical electrostatic PIC against the two-stream instability, a standard benchmark in plasma physics. Our results show that the quantum Poisson solver achieves comparable accuracy to classical methods. It also provides insights into the feasibility of using quantum computing and HNNs for plasma simulations. We also discuss the computational overhead associated with current quantum computer simulators, showing the challenges and potential advantages of hybrid quantum-classical numerical methods.

Ort, förlag, år, upplaga, sidor
Elsevier BV, 2026
Nyckelord
Hybrid Quantum-Classical Computing, Particle-in-Cell (PIC) Method, Electrostatic Poisson Solver, Quantum Neural Networks (QNNs)
Nationell ämneskategori
Fusion, plasma och rymdfysik Datorsystem
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-368973 (URN)10.1016/j.future.2025.108087 (DOI)001561183000001 ()2-s2.0-105013835560 (Scopus ID)
Anmärkning

QC 20250825

Tillgänglig från: 2025-08-25 Skapad: 2025-08-25 Senast uppdaterad: 2025-09-17Bibliografiskt granskad
Shi, R., Schieffer, G., Gokhale, M., Lin, P. H., Patel, H. & Peng, B. (2026). ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace. In: Euro-Par 2025: Parallel Processing - 31st European Conference on Parallel and Distributed Processing, Proceedings: . Paper presented at 31st International Conference on Parallel and Distributed Computing, Euro-Par 2025, Dresden, Germany, August 25-29, 2025 (pp. 33-47). Springer Nature
Öppna denna publikation i ny flik eller fönster >>ARM SVE Unleashed: Performance and Insights Across HPC Applications on Nvidia Grace
Visa övriga...
2026 (Engelska)Ingår i: Euro-Par 2025: Parallel Processing - 31st European Conference on Parallel and Distributed Processing, Proceedings, Springer Nature , 2026, s. 33-47Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Vector architectures are essential for boosting computing throughput. ARM provides SVE as the next-generation length-agnostic vector extension beyond traditional fixed-length SIMD. This work provides a first study of the maturity and readiness of exploiting ARM and SVE in HPC. Using selected performance hardware events on the ARM Grace processor and analytical models, we derive new metrics to quantify the effectiveness of exploiting SVE vectorization to reduce executed instructions and improve performance speedup. We further propose an adapted roofline model that combines vector length and data elements to identify potential performance bottlenecks. Finally, we propose a decision tree for classifying the SVE-boosted performance in applications.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2026
Nationell ämneskategori
Datorsystem Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-370460 (URN)10.1007/978-3-031-99857-7_3 (DOI)2-s2.0-105014494119 (Scopus ID)
Konferens
31st International Conference on Parallel and Distributed Computing, Euro-Par 2025, Dresden, Germany, August 25-29, 2025
Anmärkning

Part of ISBN 9783031998560

QC 20250929

Tillgänglig från: 2025-09-29 Skapad: 2025-09-29 Senast uppdaterad: 2025-09-29Bibliografiskt granskad
Netzer, G., Hegde, P. R., Peng, I. B. & Markidis, S. (2026). Tightly-integrated quantum–classical computing using the QHDL hardware description language. Future Generation Computer Systems, 174, Article ID 107977.
Öppna denna publikation i ny flik eller fönster >>Tightly-integrated quantum–classical computing using the QHDL hardware description language
2026 (Engelska)Ingår i: Future Generation Computer Systems, ISSN 0167-739X, E-ISSN 1872-7115, Vol. 174, artikel-id 107977Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

We present the design, development, and application of QHDL, a quantum hardware description language specifically designed for tightly-coupled quantum–classical computing systems. Together with the language design principles, we describe the QHDL compiler, debugger, and co-simulation infrastructure. We showcase the benefits of using a quantum–classical integrated approach in four use cases, requiring close quantum–classical device interaction: Bell's pair circuit, dynamic delay, Quantum Fourier Transform (QFT), and teleportation. To interface with QHDL, we propose to use synchronous techniques that are commonplace in digital hardware design. We illustrate examples of modeling both loosely-coupled and tightly-coupled quantum circuits that use so-called measurement-in-the-middle by utilizing these techniques in QHDL. For clock-cycle accurate implementations, we propose implementing such classical modules as programmable hardware blocks using Register-Transfer Level (RTL) or gate-level approaches. These approaches provide the highest coupling performance and are feasible to be implemented in state-of-the-art control systems.

Ort, förlag, år, upplaga, sidor
Elsevier BV, 2026
Nyckelord
Classical feedback, Hardware description languages, Quantum circuits, Quantum computing, Quantum software stack, Quantum–classical algorithms
Nationell ämneskategori
Beräkningsmatematik
Identifikatorer
urn:nbn:se:kth:diva-368837 (URN)10.1016/j.future.2025.107977 (DOI)001524931100003 ()2-s2.0-105009494290 (Scopus ID)
Anmärkning

QC 20250902

Tillgänglig från: 2025-09-02 Skapad: 2025-09-02 Senast uppdaterad: 2025-09-02Bibliografiskt granskad
Hübner, P., Hu, A., Peng, I. & Markidis, S. (2025). Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency. In: Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025: . Paper presented at 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Milan, Italy, June 3-7, 2025 (pp. 45-54). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Apple vs. Oranges: Evaluating the Apple Silicon M-Series SoCs for HPC Performance and Efficiency
2025 (Engelska)Ingår i: Proceedings - 2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Institute of Electrical and Electronics Engineers (IEEE) , 2025, s. 45-54Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper investigates the architectural features and performance potential of the Apple Silicon M-Series SoCs (M1, M2, M3, and M4) for HPC. We provide a detailed review of the CPU and GPU designs, the unified memory architecture, and coprocessors such as Advanced Matrix Extensions (AMX). We design and develop benchmarks in the Metal Shading Language and Objective-C++ to assess FP32 computational and memory performance. We also measure power consumption and efficiency using Apple's powermetrics tool. Our results show that the M-Series chips offer up to 100 GB/s memory bandwidth, and significant generational improvements in computational performance, with up to 2.9 FP32 TFLOPS on the M4. Power consumption varies from a few Watts to 10-20 Watts, with more than 200 GFLOPS per Watt efficiency of GPU and accelerator reached by all four chips. Despite limitations in FP64 support on the GPU, the M-Series chips demonstrate strong potential for energy-efficient HPC applications. While existing HPC solutions such as the Nvidia Grace-Hopper superchip outperform Apple Silicon in both memory bandwidth and computational performance, we see that the M-Series provides a competitive power-efficient alternative to traditional HPC architectures and represents a distinct category altogether - forming an apples-to-oranges comparison.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2025
Nyckelord
Apple Silicon M-Series GPU Performance, ARM-based SoC, M1, M2, M3, M4 Architecture
Nationell ämneskategori
Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-370765 (URN)10.1109/IPDPSW66978.2025.00013 (DOI)2-s2.0-105015528421 (Scopus ID)
Konferens
2025 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2025, Milan, Italy, June 3-7, 2025
Anmärkning

Part of ISBN 9798331526436

QC 20251001

Tillgänglig från: 2025-10-01 Skapad: 2025-10-01 Senast uppdaterad: 2025-10-01Bibliografiskt granskad
Araújo De Medeiros, D., Williams, J. J., Wahlgren, J., Saud Maia Leite, L. & Peng, I. B. (2025). ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments. In: 31st International European Conference on Parallel and Distributed Computing: . Paper presented at The 31st International European Conference on Parallel and Distributed Computing (Euro-Par ’25), Dresden, Germany, 25-29 Aug, 2025. Springer Nature
Öppna denna publikation i ny flik eller fönster >>ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Visa övriga...
2025 (Engelska)Ingår i: 31st International European Conference on Parallel and Distributed Computing, Springer Nature , 2025Konferensbidrag, Publicerat paper (Refereegranskat)
Ort, förlag, år, upplaga, sidor
Springer Nature, 2025
Nyckelord
Vertical scaling, HPC workloads, Cloud Computing, Resource Adaptivity, Memory Resource Provisioning
Nationell ämneskategori
Elektroteknik och elektronik
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-363170 (URN)10.1007/978-3-031-99854-6_12 (DOI)2-s2.0-105015430232 (Scopus ID)
Konferens
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par ’25), Dresden, Germany, 25-29 Aug, 2025
Anmärkning

QC 20250923

Tillgänglig från: 2025-05-06 Skapad: 2025-05-06 Senast uppdaterad: 2025-09-23Bibliografiskt granskad
Peng, I. B., Wahlgren, J., Youssef, K., Iwabuchi, K., Pearce, R. & Gokhale, M. (2025). UMap: An application-oriented user level memory mapping library. The international journal of high performance computing applications, 39(2), 269-282
Öppna denna publikation i ny flik eller fönster >>UMap: An application-oriented user level memory mapping library
Visa övriga...
2025 (Engelska)Ingår i: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 39, nr 2, s. 269-282Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Exploiting the prominent role of complex memories in exascale node architecture, the UMap page fault handler offers new capabilities to access large memory-mapped data sets directly. UMap provides flexible configuration options to customize page handling to each application, including analysis of massive observational and simulation data sets. The high-performance design features I/O decoupling, dynamic load balancing, and application-level controls. Page faults triggered by application threads and processes accessing data mapped to a UMapp’ed region are handled via the Linux userfaultfd protocol, an asynchronous message-oriented kernel-user communication mechanism that avoids the context switch penalty of traditional signal fault handlers. UMap is fully open source. In this paper, we give an overview of the UMap library architecture, its extensible plugin architecture, and the use/performance of UMap in emerging heterogeneous memory hierarchies such as near-node Non-volatile Memory (NVM) and network attached memories. We highlight new capabilities in two pagefault management plugins, the NetworkStore and SparseStore. We demonstrate the integration between UMap and multiple ECP products including Caliper, Metall, ZFP, Mochi, and Ripples.

Ort, förlag, år, upplaga, sidor
SAGE Publications, 2025
Nyckelord
Memory mapping, mmap, page fault, page management, user-space paging, userfaultfd
Nationell ämneskategori
Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-362548 (URN)10.1177/10943420241303145 (DOI)001370462800001 ()2-s2.0-105002054781 (Scopus ID)
Anmärkning

QC 20250422

Tillgänglig från: 2025-04-16 Skapad: 2025-04-16 Senast uppdaterad: 2025-04-22Bibliografiskt granskad
Araújo De Medeiros, D., Schieffer, G., Wahlgren, J. & Peng, I. B. (2025). Understanding Layered Portability from HPC to Cloud in Containerized Environments. In: Weiland, M Neuwirth, S Kruse, C Weinzierl, T (Ed.), Proceedings High Performance Computing. ISC High Performance 2024 International Workshops: . Paper presented at 39th International Conference of the ISC High Performance, MAY 12-16, 2024, Hamburg, GERMANY (pp. 439-452). Springer Nature, 15058
Öppna denna publikation i ny flik eller fönster >>Understanding Layered Portability from HPC to Cloud in Containerized Environments
2025 (Engelska)Ingår i: Proceedings High Performance Computing. ISC High Performance 2024 International Workshops / [ed] Weiland, M Neuwirth, S Kruse, C Weinzierl, T, Springer Nature , 2025, Vol. 15058, s. 439-452Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Recent development in lightweight OS-level virtualization, containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus on the impact of different layers in a containerized environment when migrating HPC containers from a dedicated HPC system to a cloud platform. On three ARM-based platforms, including the latest Nvidia Grace CPU, we use six representative HPC applications to characterize the impact of container virtualization, host OS and kernel, and rootless and privileged container execution. Our results indicate less than 4% container overhead in DGEMM, miniMD, and XSBench, but 8%-10% overhead in FFT, HPCG, and Hypre. We also show that changing between the container execution modes results in negligible performance differences in the six applications.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2025
Serie
Lecture Notes in Computer Science, ISSN 0302-9743
Nyckelord
Cloud and HPC Convergence, Containers, ARM, Performance
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-366126 (URN)10.1007/978-3-031-73716-9_31 (DOI)001463189500031 ()2-s2.0-105009319340 (Scopus ID)
Konferens
39th International Conference of the ISC High Performance, MAY 12-16, 2024, Hamburg, GERMANY
Anmärkning

Part of ISBN 978-3-031-73715-2, 978-3-031-73716-9

QC 20250703

Tillgänglig från: 2025-07-03 Skapad: 2025-07-03 Senast uppdaterad: 2025-07-10Bibliografiskt granskad
Hegde, P. R., Kyriienko, O., Heimonen, H., Tolias, P., Netzer, G., Barkoutsos, P., . . . Markidis, S. (2024). Beyond the Buzz: Strategic Paths for Enabling Useful NISQ Applications. In: Proceedings of the 21st ACM International Conference on Computing Frontiers, CF 2024: . Paper presented at 21st ACM International Conference on Computing Frontiers, CF 2024, Ischia, Italy, May 7 2024 - May 9 2024 (pp. 310-313). Association for Computing Machinery (ACM)
Öppna denna publikation i ny flik eller fönster >>Beyond the Buzz: Strategic Paths for Enabling Useful NISQ Applications
Visa övriga...
2024 (Engelska)Ingår i: Proceedings of the 21st ACM International Conference on Computing Frontiers, CF 2024, Association for Computing Machinery (ACM) , 2024, s. 310-313Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

There is much debate on whether quantum computing on current NISQ devices, consisting of noisy hundred qubits and requiring a non-negligible usage of classical computing as part of the algorithms, has utility and will ever offer advantages for scientific and industrial applications with respect to traditional computing. In this position paper, we argue that while real-world NISQ quantum applications have yet to surpass their classical counterparts, strategic approaches can be used to facilitate advancements in both industrial and scientific applications. We have identified three key strategies to guide NISQ computing towards practical and useful implementations. Firstly, prioritizing the identification of a "killer app"is a key point. An application demonstrating the distinctive capabilities of NISQ devices can catalyze broader development. We suggest focusing on applications that are inherently quantum, e.g., pointing towards quantum chemistry and material science as promising domains. These fields hold the potential to exhibit benefits, setting benchmarks for other applications to follow. Secondly, integrating AI and deep-learning methods into NISQ computing is a promising approach. Examples such as quantum Physics-Informed Neural Networks and Differentiable Quantum Circuits (DQC) demonstrate the synergy between quantum computing and AI. Lastly, recognizing the interdisciplinary nature of NISQ computing, we advocate for a co-design approach. Achieving synergy between classical and quantum computing necessitates an effort in co-designing quantum applications, algorithms, and programming environments, and the integration of HPC with quantum hardware. The interoperability of these components is crucial for enabling the full potential of NISQ computing. In conclusion, through the usage of these three approaches, we argue that NISQ computing can surpass current limitations and evolve into a valuable tool for scientific and industrial applications. This requires an approach that integrates domain-specific killer apps, harnesses the power of quantum-enhanced AI, and embraces a collaborative co-design methodology.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2024
Nyckelord
AI & Quantum, Codesign, NISQ Computing, Quantum Applications
Nationell ämneskategori
Datavetenskap (datalogi) Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-350989 (URN)10.1145/3649153.3649182 (DOI)001267265700037 ()2-s2.0-85198901369 (Scopus ID)
Konferens
21st ACM International Conference on Computing Frontiers, CF 2024, Ischia, Italy, May 7 2024 - May 9 2024
Anmärkning

Part of ISBN 9798400705977

QC 20240725

Tillgänglig från: 2024-07-24 Skapad: 2024-07-24 Senast uppdaterad: 2024-09-10Bibliografiskt granskad
Schieffer, G., Pornthisan, N., Araújo De Medeiros, D., Markidis, S., Wahlgren, J. & Peng, I. B. (2024). Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU. In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers: . Paper presented at International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Aug 28 2023 - Sep 1 2023 Limassol, Cyprus (pp. 294-305). Springer Nature
Öppna denna publikation i ny flik eller fönster >>Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU
Visa övriga...
2024 (Engelska)Ingår i: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers, Springer Nature , 2024, s. 294-305Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on Nvidia V100, A100, A40 and T4 GPUs. To mitigate numerical instability and precision losses, we introduce algorithmic changes in the particle filters. Using half-precision leads to a performance improvement of 1.5–2 × and 2.5–4.6 × with respect to single- and double-precision baselines respectively, at the cost of a relatively small loss of accuracy.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2024
Nyckelord
GPUs, Half-Precision, Particle Filter, Reduced Precision
Nationell ämneskategori
Elektroteknik och elektronik
Identifikatorer
urn:nbn:se:kth:diva-346540 (URN)10.1007/978-3-031-50684-0_23 (DOI)001279250600023 ()2-s2.0-85192268315 (Scopus ID)
Konferens
International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Aug 28 2023 - Sep 1 2023 Limassol, Cyprus
Anmärkning

Part of proceedings ISBN: 978-303150683-3

QC 20240520

Tillgänglig från: 2024-05-16 Skapad: 2024-05-16 Senast uppdaterad: 2024-09-10Bibliografiskt granskad
Wahlgren, J., Schieffer, G., Gokhale, M., Pearce, R. & Peng, I. (2024). Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing. In: : . Paper presented at IEEE/SBC 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, Hawaii, November 13-15, 2024. Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing
Visa övriga...
2024 (Engelska)Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Disaggregated memory breaks the boundary of monolithic servers to enable memory provisioning on demand. Using network-attached memory to provide memory expansion for memory-intensive applications on compute nodes can improve the overall memory utilization on a cluster and reduce the total cost of ownership. However, current software solutions for leveraging network-attached memory must consume resources on the compute node for memory management tasks. Emerging off-path smartNICs provide general-purpose programmability at low-cost low-power cores. This work provides a general architecture design that enables network-attached memory and offloading tasks onto off-path programmable SmartNIC. We provide a prototype implementation called SODA on Nvidia BlueField DPU. SODA adapts communication paths and data transfer alternatives, pipelines data movement stages, and enables customizable data caching and prefetching optimizations. We evaluate SODA in five representative graph applications on real-world graphs. Our results show that SODA can achieve up to 7.9x speedup compared to node-local SSD and reduce network traffic by 42% compared to disaggregated memory without SmartNIC offloading at similar or better performance.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Nyckelord
smartnic, disaggregated memory, fabric-attached memory
Nationell ämneskategori
Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-326630 (URN)10.1109/SBAC-PAD63648.2024.00022 (DOI)2-s2.0-85212448438 (Scopus ID)
Konferens
IEEE/SBC 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, Hawaii, November 13-15, 2024
Anmärkning

QC 20250113

Tillgänglig från: 2023-05-08 Skapad: 2024-10-03 Senast uppdaterad: 2025-01-13Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-4158-3583

Sök vidare i DiVA

Visa alla publikationer