kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (10 of 19) Show all publications
Araújo De Medeiros, D., Williams, J. J., Wahlgren, J., Saud Maia Leite, L. & Peng, I. B. (2025). ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments. In: 31st International European Conference on Parallel and Distributed Computing: . Paper presented at The 31st International European Conference on Parallel and Distributed Computing (Euro-Par ’25), Dresden, Germany, 25-29 Aug, 2025. Springer Nature
Open this publication in new window or tab >>ARC-V: Vertical Resource Adaptivity for HPC Workloads in Containerized Environments
Show others...
2025 (English)In: 31st International European Conference on Parallel and Distributed Computing, Springer Nature , 2025Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Springer Nature, 2025
Keywords
Vertical scaling, HPC workloads, Cloud Computing, Resource Adaptivity, Memory Resource Provisioning
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363170 (URN)10.1007/978-3-031-99854-6_12 (DOI)2-s2.0-105015430232 (Scopus ID)
Conference
The 31st International European Conference on Parallel and Distributed Computing (Euro-Par ’25), Dresden, Germany, 25-29 Aug, 2025
Note

QC 20250923

Available from: 2025-05-06 Created: 2025-05-06 Last updated: 2025-09-23Bibliographically approved
Jansson, N., Karp, M., Wahlgren, J., Markidis, S. & Schlatter, P. (2025). Design of Neko—A Scalable High‐Fidelity Simulation Framework With Extensive Accelerator Support. Concurrency and Computation, 37(2), Article ID e8340.
Open this publication in new window or tab >>Design of Neko—A Scalable High‐Fidelity Simulation Framework With Extensive Accelerator Support
Show others...
2025 (English)In: Concurrency and Computation, ISSN 1532-0626, E-ISSN 1532-0634, Vol. 37, no 2, article id e8340Article in journal (Refereed) Published
Abstract [en]

Recent trends and advancements in including more diverse and heterogeneous hardware in High-Performance Computing (HPC) are challenging scientific software developers in their pursuit of efficient numerical methods with sustained performance across a diverse set of platforms. As a result, researchers are today forced to re-factor their codes to leverage these powerful new heterogeneous systems. We present our design considerations of Neko—a portable framework for high-fidelity spectral element flow simulations. Unlike prior works, Neko adopts a modern object-oriented Fortran 2008 approach, allowing multi-tier abstractions of the solver stack and facilitating various hardware backends ranging from general-purpose processors, accelerators down to exotic vector processors and Field-Programmable Gate Arrays (FPGAs). Focusing on the performance and portability of Neko, we describe the framework's device abstraction layer managing device memory, data transfer and kernel launches from Fortran, allowing for a solver written in a hardware-neutral yet performant way. Accelerator-specific optimizations are also discussed, with auto-tuning of key kernels and various communication strategies using device-aware MPI. Finally, we present performance measurements on a wide range of computing platforms, including the EuroHPC pre-exascale system LUMI, where Neko achieves excellent parallel efficiency for a large direct numerical simulation (DNS) of turbulent fluid flow using up to 80% of the entire LUMI supercomputer.

Place, publisher, year, edition, pages
Wiley, 2025
National Category
Computational Mathematics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358042 (URN)10.1002/cpe.8340 (DOI)001387473600001 ()2-s2.0-85213688601 (Scopus ID)
Funder
Swedish Research Council, 2019‐04723Swedish e‐Science Research Center, SESSIEU, Horizon Europe, 101093393
Note

QC 20250122

Available from: 2025-01-03 Created: 2025-01-03 Last updated: 2025-01-22Bibliographically approved
Wahlgren, J., Schieffer, G., Shi, R., León, E. A., Pearce, R., Gokhale, M. & Peng, I. (2025). Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs. In: Proceedings - 2025 IEEE International Symposium on Workload Characterization, IISWC 2025: . Paper presented at 28th IEEE International Symposium on Workload Characterization, IISWC 2025, Irvine, United States of America, October 12-14, 2025 (pp. 368-380). Institute of Electrical and Electronics Engineers Inc.
Open this publication in new window or tab >>Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs
Show others...
2025 (English)In: Proceedings - 2025 IEEE International Symposium on Workload Characterization, IISWC 2025, Institute of Electrical and Electronics Engineers Inc. , 2025, p. 368-380Conference paper, Published paper (Refereed)
Abstract [en]

Discrete GPUs are a cornerstone of HPC and data center systems, requiring management of separate CPU and GPU memory spaces. Unified Virtual Memory (UVM) has been proposed to ease the burden of memory management; however, at a high cost in performance. The recent introduction of AMD's MI300A Accelerated Processing Units (APUs) - as deployed in the El Capitan supercomputer - enables HPC systems featuring integrated CPU and GPU with Unified Physical Memory (UPM) for the first time. This work presents the first comprehensive characterization of the UPM architecture on MI300A. We first analyze the UPM system properties, including memory latency, bandwidth, and coherence overhead. We then assess the efficiency of the system software in memory allocation, page fault handling, TLB management, and Infinity Cache utilization. We propose a set of porting strategies for transforming applications for the UPM architecture and evaluate six applications on the MI300A APU. Our results show that applications on UPM using the unified memory model can match or outperform those in the explicitly managed model - while reducing memory costs by up to 44%.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc., 2025
Keywords
GPU Memory Management, High Performance Computing (HPC), Memory System Characterization
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:kth:diva-377367 (URN)10.1109/IISWC66894.2025.00038 (DOI)2-s2.0-105029040045 (Scopus ID)
Conference
28th IEEE International Symposium on Workload Characterization, IISWC 2025, Irvine, United States of America, October 12-14, 2025
Note

Part of ISBN 9798331549176

QC 20260226

Available from: 2026-02-26 Created: 2026-02-26 Last updated: 2026-02-26Bibliographically approved
Peng, I. B., Wahlgren, J., Youssef, K., Iwabuchi, K., Pearce, R. & Gokhale, M. (2025). UMap: An application-oriented user level memory mapping library. The international journal of high performance computing applications, 39(2), 269-282
Open this publication in new window or tab >>UMap: An application-oriented user level memory mapping library
Show others...
2025 (English)In: The international journal of high performance computing applications, ISSN 1094-3420, E-ISSN 1741-2846, Vol. 39, no 2, p. 269-282Article in journal (Refereed) Published
Abstract [en]

Exploiting the prominent role of complex memories in exascale node architecture, the UMap page fault handler offers new capabilities to access large memory-mapped data sets directly. UMap provides flexible configuration options to customize page handling to each application, including analysis of massive observational and simulation data sets. The high-performance design features I/O decoupling, dynamic load balancing, and application-level controls. Page faults triggered by application threads and processes accessing data mapped to a UMapp’ed region are handled via the Linux userfaultfd protocol, an asynchronous message-oriented kernel-user communication mechanism that avoids the context switch penalty of traditional signal fault handlers. UMap is fully open source. In this paper, we give an overview of the UMap library architecture, its extensible plugin architecture, and the use/performance of UMap in emerging heterogeneous memory hierarchies such as near-node Non-volatile Memory (NVM) and network attached memories. We highlight new capabilities in two pagefault management plugins, the NetworkStore and SparseStore. We demonstrate the integration between UMap and multiple ECP products including Caliper, Metall, ZFP, Mochi, and Ripples.

Place, publisher, year, edition, pages
SAGE Publications, 2025
Keywords
Memory mapping, mmap, page fault, page management, user-space paging, userfaultfd
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-362548 (URN)10.1177/10943420241303145 (DOI)001370462800001 ()2-s2.0-105002054781 (Scopus ID)
Note

QC 20250422

Available from: 2025-04-16 Created: 2025-04-16 Last updated: 2025-04-22Bibliographically approved
Araújo De Medeiros, D., Schieffer, G., Wahlgren, J. & Peng, I. B. (2025). Understanding Layered Portability from HPC to Cloud in Containerized Environments. In: Weiland, M Neuwirth, S Kruse, C Weinzierl, T (Ed.), Proceedings High Performance Computing. ISC High Performance 2024 International Workshops: . Paper presented at 39th International Conference of the ISC High Performance, MAY 12-16, 2024, Hamburg, GERMANY (pp. 439-452). Springer Nature, 15058
Open this publication in new window or tab >>Understanding Layered Portability from HPC to Cloud in Containerized Environments
2025 (English)In: Proceedings High Performance Computing. ISC High Performance 2024 International Workshops / [ed] Weiland, M Neuwirth, S Kruse, C Weinzierl, T, Springer Nature , 2025, Vol. 15058, p. 439-452Conference paper, Published paper (Refereed)
Abstract [en]

Recent development in lightweight OS-level virtualization, containers, provides a potential solution for running HPC applications on the cloud platform. In this work, we focus on the impact of different layers in a containerized environment when migrating HPC containers from a dedicated HPC system to a cloud platform. On three ARM-based platforms, including the latest Nvidia Grace CPU, we use six representative HPC applications to characterize the impact of container virtualization, host OS and kernel, and rootless and privileged container execution. Our results indicate less than 4% container overhead in DGEMM, miniMD, and XSBench, but 8%-10% overhead in FFT, HPCG, and Hypre. We also show that changing between the container execution modes results in negligible performance differences in the six applications.

Place, publisher, year, edition, pages
Springer Nature, 2025
Series
Lecture Notes in Computer Science, ISSN 0302-9743
Keywords
Cloud and HPC Convergence, Containers, ARM, Performance
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-366126 (URN)10.1007/978-3-031-73716-9_31 (DOI)001463189500031 ()2-s2.0-105009319340 (Scopus ID)
Conference
39th International Conference of the ISC High Performance, MAY 12-16, 2024, Hamburg, GERMANY
Note

Part of ISBN 978-3-031-73715-2, 978-3-031-73716-9

QC 20250703

Available from: 2025-07-03 Created: 2025-07-03 Last updated: 2025-07-10Bibliographically approved
Schieffer, G., Pornthisan, N., Araújo De Medeiros, D., Markidis, S., Wahlgren, J. & Peng, I. B. (2024). Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU. In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers: . Paper presented at International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Aug 28 2023 - Sep 1 2023 Limassol, Cyprus (pp. 294-305). Springer Nature
Open this publication in new window or tab >>Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU
Show others...
2024 (English)In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, Limassol, Cyprus, August 28 – September 1, 2023, Revised Selected Papers, Springer Nature , 2024, p. 294-305Conference paper, Published paper (Refereed)
Abstract [en]

High-performance GPU-accelerated particle filter methods are critical for object detection applications, ranging from autonomous driving, robot localization, to time-series prediction. In this work, we investigate the design, development and optimization of particle-filter using half-precision on CUDA cores and compare their performance and accuracy with single- and double-precision baselines on Nvidia V100, A100, A40 and T4 GPUs. To mitigate numerical instability and precision losses, we introduce algorithmic changes in the particle filters. Using half-precision leads to a performance improvement of 1.5–2 × and 2.5–4.6 × with respect to single- and double-precision baselines respectively, at the cost of a relatively small loss of accuracy.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
GPUs, Half-Precision, Particle Filter, Reduced Precision
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-346540 (URN)10.1007/978-3-031-50684-0_23 (DOI)001279250600023 ()2-s2.0-85192268315 (Scopus ID)
Conference
International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Aug 28 2023 - Sep 1 2023 Limassol, Cyprus
Note

Part of proceedings ISBN: 978-303150683-3

QC 20240520

Available from: 2024-05-16 Created: 2024-05-16 Last updated: 2024-09-10Bibliographically approved
Wahlgren, J., Schieffer, G., Gokhale, M., Pearce, R. & Peng, I. (2024). Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing. In: : . Paper presented at IEEE/SBC 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, Hawaii, November 13-15, 2024. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Disaggregated Memory with SmartNIC Offloading: a Case Study on Graph Processing
Show others...
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Disaggregated memory breaks the boundary of monolithic servers to enable memory provisioning on demand. Using network-attached memory to provide memory expansion for memory-intensive applications on compute nodes can improve the overall memory utilization on a cluster and reduce the total cost of ownership. However, current software solutions for leveraging network-attached memory must consume resources on the compute node for memory management tasks. Emerging off-path smartNICs provide general-purpose programmability at low-cost low-power cores. This work provides a general architecture design that enables network-attached memory and offloading tasks onto off-path programmable SmartNIC. We provide a prototype implementation called SODA on Nvidia BlueField DPU. SODA adapts communication paths and data transfer alternatives, pipelines data movement stages, and enables customizable data caching and prefetching optimizations. We evaluate SODA in five representative graph applications on real-world graphs. Our results show that SODA can achieve up to 7.9x speedup compared to node-local SSD and reduce network traffic by 42% compared to disaggregated memory without SmartNIC offloading at similar or better performance.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
smartnic, disaggregated memory, fabric-attached memory
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-326630 (URN)10.1109/SBAC-PAD63648.2024.00022 (DOI)2-s2.0-85212448438 (Scopus ID)
Conference
IEEE/SBC 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, Hawaii, November 13-15, 2024
Note

QC 20250113

Available from: 2023-05-08 Created: 2024-10-03 Last updated: 2025-01-13Bibliographically approved
Schieffer, G., Wahlgren, J., Ren, J., Faj, J. & Peng, I. B. (2024). Harnessing Integrated CPU-GPU System Memory for HPC: A first look into Grace Hopper. In: 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings: . Paper presented at 53rd International Conference on Parallel Processing, ICPP 2024, Gotland, Sweden, Aug 12 2024 - Aug 15 2024 (pp. 199-209). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Harnessing Integrated CPU-GPU System Memory for HPC: A first look into Grace Hopper
Show others...
2024 (English)In: 53rd International Conference on Parallel Processing, ICPP 2024 - Main Conference Proceedings, Association for Computing Machinery (ACM) , 2024, p. 199-209Conference paper, Published paper (Refereed)
Abstract [en]

Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabling a Unified Memory system. In this work, we provide the first in-depth study of the system memory management on the Grace Hopper Superchip, in both in-memory and memory oversubscription scenarios. We provide a suite of six representative applications, including the Qiskit quantum computing simulator, using system memory and managed memory. Using our memory utilization profiler and hardware counters, we quantify and characterize the impact of the integrated CPU-GPU system page table on GPU applications. Our study focuses on first-touch policy, page table entry initialization, page sizes, and page migration. We identify practical optimization strategies for different access patterns. Our results show that as a new solution for unified memory, the system-allocated memory can benefit most use cases with minimal porting efforts.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Grace Hopper, heterogeneous memory, NVLink, NVLink-C2C, unified memory
National Category
Computer Systems Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-353543 (URN)10.1145/3673038.3673110 (DOI)001323772600020 ()2-s2.0-85202442936 (Scopus ID)
Conference
53rd International Conference on Parallel Processing, ICPP 2024, Gotland, Sweden, Aug 12 2024 - Aug 15 2024
Note

Part of ISBN 9798400708428

QC 20240919

Available from: 2024-09-19 Created: 2024-09-19 Last updated: 2024-11-05Bibliographically approved
Miksits, S., Shi, R., Gokhale, M., Wahlgren, J., Schieffer, G. & Peng, I. B. (2024). Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE. In: Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis: . Paper presented at 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024, Atlanta, United States of America, Nov 17 2024 - Nov 22 2024 (pp. 996-1005). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE
Show others...
2024 (English)In: Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 996-1005Conference paper, Published paper (Refereed)
Abstract [en]

High-end ARM processors are emerging in data centers and HPC systems, posing as a strong contender to x86 machines. Memory-centric profiling is an important approach for dissecting an application's bottlenecks on memory access and guiding optimizations. Many existing memory profiling tools leverage hardware performance counters and precise event sampling, such as Intel PEBS and AMD IBS, to achieve high accuracy and low overhead. In this work, we present a multi-level memory profiling tool for ARM processors, leveraging Statistical Profiling Extension (SPE). We evaluate the tool using both HPC and Cloud workloads on the ARM Ampere processor. Our results provide the first quantitative assessment of time overhead and sampling accuracy of ARM SPE for memory-centric profiling at different sampling periods and aux buffer sizes.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
ARM SPE, memory profiling, precise event sampling
National Category
Computer Systems Computer Sciences
Identifiers
urn:nbn:se:kth:diva-360172 (URN)10.1109/SCW63240.2024.00139 (DOI)001451792300112 ()2-s2.0-85217180414 (Scopus ID)
Conference
2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024, Atlanta, United States of America, Nov 17 2024 - Nov 22 2024
Note

Part of ISBN 979-835035554-3

QC 20250224

Available from: 2025-02-19 Created: 2025-02-19 Last updated: 2025-09-24Bibliographically approved
Peng, I. B., Schulz, M., Haus, U. U., Prunty, C., Marcuello, P., Danovaro, E., . . . Markidis, S. (2024). OpenCUBE: Building an Open Source Cloud Blueprint with EPI Systems. In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, 2023, Revised Selected Papers. Paper presented at International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Limassol, Cyprus, Aug 28 2023 - Sep 1 2023 (pp. 260-264). Springer Nature
Open this publication in new window or tab >>OpenCUBE: Building an Open Source Cloud Blueprint with EPI Systems
Show others...
2024 (English)In: Euro-Par 2023: Parallel Processing Workshops - Euro-Par 2023 International Workshops, 2023, Revised Selected Papers, Springer Nature , 2024, p. 260-264Conference paper, Published paper (Refereed)
Abstract [en]

OpenCUBE aims to develop an open-source full software stack for Cloud computing blueprint deployed on EPI hardware, adaptable to emerging workloads across the computing continuum. OpenCUBE prioritizes energy awareness and utilizes open APIs, Open Source components, advanced SiPearl Rhea processors, and RISC-V accelerator. The project leverages representative workloads, such as cloud-native workloads and workflows of weather forecast data management, molecular docking, and space weather, for evaluation and validation.

Place, publisher, year, edition, pages
Springer Nature, 2024
Series
Lecture Notes in Computer Science
Keywords
Computing continuum, Converged HPC and Cloud, EPI, Open-source, RISC-V
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-346145 (URN)10.1007/978-3-031-48803-0_29 (DOI)001279248600029 ()2-s2.0-85190984237 (Scopus ID)
Conference
International workshops held at the 29th International Conference on Parallel and Distributed Computing, Euro-Par 2023, Limassol, Cyprus, Aug 28 2023 - Sep 1 2023
Note

Part of proceedings ISBN: 978-303148802-3

QC 20240506

Available from: 2024-05-03 Created: 2024-05-03 Last updated: 2024-09-10Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1669-7714

Search in DiVA

Show all publications