kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 19) Show all publications
Bouvry, P., Brorsson, M., Canal, R., Eftekhari, A., Höfinger, S., Smets, D., . . . Silvano, C. (2025). The European master for HPC curriculum. Journal of Parallel and Distributed Computing, 201, Article ID 105081.
Open this publication in new window or tab >>The European master for HPC curriculum
Show others...
2025 (English)In: Journal of Parallel and Distributed Computing, ISSN 0743-7315, E-ISSN 1096-0848, Vol. 201, article id 105081Article in journal (Refereed) Published
Abstract [en]

The use of High-Performance Computing (HPC) is crucial for addressing various grand challenges. While significant investments are made in digital infrastructures that comprise HPC resources, its realisation, operation, and, in particular, its use critically depends on suitably trained experts. In this paper, we present the results of an effort to design and implement a pan-European reference curriculum for a master's degree in HPC.

Keywords
Computing education, High-performance computing, Master in HPC, Model curricula
National Category
Other Physics Topics Computer Sciences Computer Systems
Identifiers
urn:nbn:se:kth:diva-362519 (URN)10.1016/j.jpdc.2025.105081 (DOI)001466093600001 ()2-s2.0-105001852460 (Scopus ID)
Note

QC 20250422

Available from: 2025-04-16 Created: 2025-04-16 Last updated: 2025-05-22Bibliographically approved
Pichetti, L., De Sensi, D., Sivalingam, K., Nassyr, S., Cesarini, D., Turisini, M., . . . Vella, F. (2024). Benchmarking Ethernet Interconnect for HPC/AI workloads. In: Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis: . Paper presented at 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024, Atlanta, United States of America, Nov 17 2024 - Nov 22 2024 (pp. 869-875). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Benchmarking Ethernet Interconnect for HPC/AI workloads
Show others...
2024 (English)In: Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 869-875Conference paper, Published paper (Refereed)
Abstract [en]

Interconnects have always played a cornerstone role In HPC. Since the Inception of the Top500 ranking, Interconnect statistics have been predominantly dominated by two competing technologies: InfiniBand and Ethernet. However, even if Ethernet is very popular due to versatility and cost-effectiveness, InfiniBand used to provide higher bandwidth and continues to feature lower latency. Industry seeks for a further evolution of the Ethernet standards to enable fast and low-latency interconnect for emerging AI workloads by offering competitive, open-standard solutions. This paper analyzes the early results obtained from two systems relying on an HPC Ethernet interconnect, one relying on 100G and the other on 200G Ethernet. Preliminary findings indicate that the Ethernet-based networks exhibit competitive performance, closely aligning with InfiniBand, especially for large message exchanges.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
ethernet, gigabit ethernet, hpc/ai workloads, infiniband, interconnect
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-360171 (URN)10.1109/SCW63240.2024.00124 (DOI)2-s2.0-85217184791 (Scopus ID)
Conference
2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024, Atlanta, United States of America, Nov 17 2024 - Nov 22 2024
Note

Part of ISBN 979-8-3503-5554-3

QC 20250224

Available from: 2025-02-19 Created: 2025-02-19 Last updated: 2025-03-24Bibliographically approved
Zaourar, L., Benazouz, M., Mouhagir, A., Falquez, C., Portero, A., Ho, N., . . . Pleiter, D. (2024). Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC. In: Architecture of Computing Systems - 37th International Conference, ARCS 2024, Proceedings: . Paper presented at 37th International Conference on Architecture of Computing Systems, ARCS 2024, Potsdam, Germany, May 14-16, 2024 (pp. 251-265). Springer Nature
Open this publication in new window or tab >>Case Studies on the Impact and Challenges of Heterogeneous NUMA Architectures for HPC
Show others...
2024 (English)In: Architecture of Computing Systems - 37th International Conference, ARCS 2024, Proceedings, Springer Nature , 2024, p. 251-265Conference paper, Published paper (Refereed)
Abstract [en]

The memory systems of High-Performance Computing (HPC) systems commonly feature non-uniform data paths to memory, i.e. are non-uniform memory access (NUMA) architectures. Memory is divided into multiple regions, with each processing unit having its own local memory. Therefore, for each processing unit access to local memory regions is faster compared to accessing memory at non-local regions. Architectures with hybrid memory technologies result in further non-uniformity. This paper presents case studies of the performance potential and data placement implications of non-uniform and heterogeneous memory in HPC systems. Using the gem5 and VPSim simulation platforms, we model NUMA systems with processors based on the ARMv8 Neoverse V1 Reference Design. The gem5 simulator provides a cycle-accurate view, while VPSim offers greater simulation speed, with a high-level view of the simulated system. We highlight the performance impact of design trade-offs regarding NUMA node organization and System Level Cache (SLC) group assignment, as well as Network-on-Chip (NoC) configuration. Our case studies provide essential input to a co-design process involving HPC processor architects and system integrators. A comparison of system configurations for different NoC bandwidths shows reduced NoC latency and high memory bandwidth improvement when NUMA control is enabled. Furthermore, a configuration with HBM2 memory organized as four NUMA nodes highlights the memory bandwidth performance gap and NoC queuing latency impact when comparing local vs. remote memory accesses. On the other hand, NUMA can result in an unbalanced distribution of memory accesses and reduced SLC hit ratios, as shown with DDR4 memory organized as four NUMA nodes.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
benchmarking, co-design, High Performance Computing (HPC), Non-Uniform Memory Access (NUMA), simulation
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-352150 (URN)10.1007/978-3-031-66146-4_17 (DOI)001293533700017 ()2-s2.0-85201001415 (Scopus ID)
Conference
37th International Conference on Architecture of Computing Systems, ARCS 2024, Potsdam, Germany, May 14-16, 2024
Note

Part of ISBN: 9783031661457

QC 20241004

Available from: 2024-08-22 Created: 2024-08-22 Last updated: 2024-10-04Bibliographically approved
Saglam, B., Ho, N., Falquez, C., Portero, A., Schätzle, F., Suarez, E. & Pleiter, D. (2024). Data Prefetching on Processors with Heterogeneous Memory. In: MEMSYS 2024 - Proceedings of the International Symposium on Memory Systems: . Paper presented at 10th International Symposium on Memory Systems, MEMSYS 2024, Washington, United States of America, Sep 30 2024 - Oct 3 2024 (pp. 45-60). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Data Prefetching on Processors with Heterogeneous Memory
Show others...
2024 (English)In: MEMSYS 2024 - Proceedings of the International Symposium on Memory Systems, Association for Computing Machinery (ACM) , 2024, p. 45-60Conference paper, Published paper (Refereed)
Abstract [en]

Heterogeneous memory architectures, such as a mix of High Bandwidth Memory (HBM) and Double Data Rate (DDR), offer flexible performance optimization by leveraging the high bandwidth of HBM along with the high capacity of DDR. However, these architectures present challenges in balancing bandwidth and capacity to maximize overall system performance and complicate hardware design. In a flat memory organization mixing HBM and DDR, prefetchers must carefully reduce prefetch requests on DDR when transitioning from HBM to avoid performance degradation due to potential bandwidth saturation. Traditional hardware prefetchers, which typically assume a homogeneous memory, are unaware of this circumstance, so they may not be effective in heterogeneous memory architectures. The paper enhances the aggressiveness of prefetchers in this kind of architecture. Our technique enables a prefetcher to dynamically determine the optimal prefetch degree and distance based on memory type. It balances prefetch aggressiveness and timeliness through an adaptive strategy informed by bandwidth utilization and prefetch metrics learned for each memory type. We evaluated the technique within the Stride and Stream Prefetchers at L2 in a gem5 model of a 20-core Arm Neoverse V1-like architecture, a mix of HBM2 and DDR5. The simulation results, focusing on scientific benchmarks, showed that the technique effectively guides prefetchers to near-optimal static configurations. On HBM2, the adaptation strategy detects bandwidth availability and prefetches more aggressively to boost performance, achieving speedups of 1.3× to 2.3×. On DDR5, when faced with saturated bandwidth contention, the adaptation strategy switches to conservative prefetching mode to mitigate performance degradation.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Hardware Prefetcher, Hybrid Memory, NUMA
National Category
Computer Systems Computer Engineering
Identifiers
urn:nbn:se:kth:diva-359656 (URN)10.1145/3695794.3695800 (DOI)2-s2.0-85216078744 (Scopus ID)
Conference
10th International Symposium on Memory Systems, MEMSYS 2024, Washington, United States of America, Sep 30 2024 - Oct 3 2024
Note

Part of ISBN 9798400710919

QC 20250206

Available from: 2025-02-06 Created: 2025-02-06 Last updated: 2025-02-06Bibliographically approved
Nassyr, S. & Pleiter, D. (2024). Exploring Processor Micro-architectures Optimised for BLAS3 Micro-kernels. In: Euro-Par 2024: Parallel Processing - 30th European Conference on Parallel and Distributed Processing, Proceedings: . Paper presented at 30th International Conference on Parallel and Distributed Computing, Euro-Par 2024, August 26-30, 2024, Madrid, Spain (pp. 47-61). Springer Nature
Open this publication in new window or tab >>Exploring Processor Micro-architectures Optimised for BLAS3 Micro-kernels
2024 (English)In: Euro-Par 2024: Parallel Processing - 30th European Conference on Parallel and Distributed Processing, Proceedings, Springer Nature , 2024, p. 47-61Conference paper, Published paper (Refereed)
Abstract [en]

Dense matrix-matrix operations are relevant for a broad range of numerical applications, e.g. for implementing deep neural networks. Past research has led to a good understanding of how these operations can be mapped in a generic manner on typical processor architectures with multiple cache levels such that near-optimal performance can be reached. However, while commonly used micro-architectures are typically suitable for such operations, their architectural parameters need to be suitably tuned. The performance of highly optimised implementations of these operations relies on micro-kernels that are often handwritten. Given the increased variety of instruction set architectures and SIMD instruction extensions, this becomes challenging. In this paper, we present and implement a methodology for an exhaustive exploration of a processor core micro-architecture design space based on gem5 simulations. Furthermore, we present a tool for generating efficiently vectorised code leveraging Arm’s SVE and RISC-V’s RVV instructions. It enables automatisation of the generation of micro-kernels and, therefore, the generation of a large range of such kernels. The results provide insights both, to micro-architecture architects as well as micro-kernel developers. The assembler generator is open-sourced and the simulation data is available as supplementary material.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
assembly generator, dense matrix-matrix multiplication, gem5 simulations, Processor micro-architectures, SIMD/vector instructions
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-353525 (URN)10.1007/978-3-031-69766-1_4 (DOI)001308370400004 ()2-s2.0-85202745849 (Scopus ID)
Conference
30th International Conference on Parallel and Distributed Computing, Euro-Par 2024, August 26-30, 2024, Madrid, Spain
Note

QC 20241023

Available from: 2024-09-19 Created: 2024-09-19 Last updated: 2024-10-23Bibliographically approved
Long, S., Pleiter, D., Patrascoiu, M., Padrin, C., Carpene, M., More, S. & Carpio, M. (2024). Integrating FTS in the Fenix HPC infrastructure. In: Espinal, X DeVita, R Laycock, P Shadura, O (Ed.), 26th international conference on computing in high energy and nuclear physics, CHEP 2023: . Paper presented at 26th International Conference on Computing in High Energy and Nuclear Physics (CHEP), May 08-12, 2023, Norfolk, VA, USA. EDP Sciences, 295, Article ID 01037.
Open this publication in new window or tab >>Integrating FTS in the Fenix HPC infrastructure
Show others...
2024 (English)In: 26th international conference on computing in high energy and nuclear physics, CHEP 2023 / [ed] Espinal, X DeVita, R Laycock, P Shadura, O, EDP Sciences , 2024, Vol. 295, article id 01037Conference paper, Published paper (Refereed)
Abstract [en]

As compute requirements in experimental high-energy physics are expected to significantly increase, there is a need for leveraging high-performance computing (HPC) resources. However, HPC systems are currently organised and operated in a way that this is not easily possible. Here we will focus on a specific e-infrastructure that incorporates HPC resources, namely Fenix, which is based on a consortium of 6 leading European supercomputing centres. Fenix was initiated through the Human Brain Project (HBP) but also provides resources to other research communities in Europe. The Fenix sites are integrated into a common AAI and provide a so-called Archival Data Repository that can be accessed through a Swift API. In this paper, we report on our efforts to realise a data transfer service that allow to exchange data with the Fenix e-infrastructure. This has been enabled by implementing support of Swift in FTS3 and related software components. We will, in particular, discuss how FTS3 has been integrated into the Fenix AAI, which largely follows the architectural principles of the European Open Science Cloud (EOSC). Furthermore, we show how end-users can use this service through a WebFTS service that has been integrated into the science gateway of the HBP, which is also known as the HBP Collaboratory. Finally, we discuss how transfer commands can be automatically distributed over several FTS3 instances to optimise transfer between different Fenix sites.

Place, publisher, year, edition, pages
EDP Sciences, 2024
Series
EPJ Web of Conferences, ISSN 2100-014X
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-353118 (URN)10.1051/epjconf/202429501037 (DOI)001244151900037 ()2-s2.0-85212211207 (Scopus ID)
Conference
26th International Conference on Computing in High Energy and Nuclear Physics (CHEP), May 08-12, 2023, Norfolk, VA, USA
Note

QC 20240912

Available from: 2024-09-12 Created: 2024-09-12 Last updated: 2025-01-07Bibliographically approved
Portero, A., Falquez, C., Ho, N., Petrakis, P., Nassyr, S., Marazakis, M., . . . Suarez, E. (2023). COMPESCE: A Co-design Approach for Memory Subsystem Performance Analysis in HPC Many-Cores. In: Architecture of Computing Systems: 36th International Conference, ARCS 2023, Proceedings. Paper presented at 36th International Conference on Architecture of Computing Systems, ARCS 2023, June 13-15, 2023, Athens, Greece (pp. 105-119). Springer Nature
Open this publication in new window or tab >>COMPESCE: A Co-design Approach for Memory Subsystem Performance Analysis in HPC Many-Cores
Show others...
2023 (English)In: Architecture of Computing Systems: 36th International Conference, ARCS 2023, Proceedings, Springer Nature , 2023, p. 105-119Conference paper, Published paper (Refereed)
Abstract [en]

This paper explores the memory subsystem design through gem5 simulations of a non-uniform memory access (NUMA) architecture with ARM cores equipped with vector engines. And connected to a Network-on-Chip (NoC) following the Coherent Hub Interface (CHI) protocol. The study quantifies the benefits of vectorization, prefetching, and multichannel NoC configurations using a benchmark for generating memory patterns and indexed accesses. The outcomes provide insights into improving bus utilization and bandwidth and reducing stalls in the system. The paper proposes hardware/software (HW/SW) advancements to reach and use the HBM device with a higher percentage than 80% at the memory controllers in the simulated manycore system.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Co-design, gem5, HPC, Network on Chip
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-337883 (URN)10.1007/978-3-031-42785-5_8 (DOI)001293532100008 ()2-s2.0-85171444909 (Scopus ID)
Conference
36th International Conference on Architecture of Computing Systems, ARCS 2023, June 13-15, 2023, Athens, Greece
Note

Part of ISBN: 9783031427848

QC 20241004

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2024-10-04Bibliographically approved
Smail, R. E., Batelaan, M., Horsley, R., Nakamura, Y., Perlt, H., Pleiter, D., . . . Zanotti, J. M. (2023). Constraining beyond the standard model nucleon isovector charges. Physical Review D: covering particles, fields, gravitation, and cosmology, 108(9), Article ID 094511.
Open this publication in new window or tab >>Constraining beyond the standard model nucleon isovector charges
Show others...
2023 (English)In: Physical Review D: covering particles, fields, gravitation, and cosmology, ISSN 2470-0010, E-ISSN 2470-0029, Vol. 108, no 9, article id 094511Article in journal (Refereed) Published
Abstract [en]

At the TeV scale, low-energy precision observations of neutron characteristics provide unique probes of novel physics. Precision studies of neutron decay observables are susceptible to beyond the Standard Model (BSM) tensor and scalar interactions, while the neutron electric dipole moment, dn, also has high sensitivity to new BSM CP-violating interactions. To fully utilize the potential of future experimental neutron physics programs, matrix elements of appropriate low-energy effective operators within neutron states must be precisely calculated. We present results from the QCDSF/UKQCD/CSSM Collaboration for the isovector charges gT, gA and gS of the nucleon, ς and Ξ baryons using lattice QCD methods and the Feynman-Hellmann theorem. We use a flavor symmetry breaking method to systematically approach the physical quark mass using ensembles that span five lattice spacings and multiple volumes. We extend this existing flavor-breaking expansion to also account for lattice spacing and finite volume effects in order to quantify all systematic uncertainties. Our final estimates of the nucleon isovector charges are gT=1.010(21)stat(12)sys,gA=1.253(63)stat(41)sys and gS=1.08(21)stat(03)sys renormalized, where appropriate, at μ=2 GeV in the MS¯ scheme.

Place, publisher, year, edition, pages
American Physical Society (APS), 2023
National Category
Subatomic Physics
Identifiers
urn:nbn:se:kth:diva-340967 (URN)10.1103/PhysRevD.108.094511 (DOI)001119009700015 ()2-s2.0-85178092762 (Scopus ID)
Note

QC 20231218

Available from: 2023-12-18 Created: 2023-12-18 Last updated: 2024-02-29Bibliographically approved
Brank, B. & Pleiter, D. (2023). CPU Architecture Modelling and Co-design. In: High Performance Computing - 38th International Conference, ISC High Performance 2023, Proceedings: . Paper presented at 38th International Conference on High Performance Computing, ISC High Performance 2023, Hamburg, Germany, May 21 2023 - May 25 2023 (pp. 3-21). Springer Nature
Open this publication in new window or tab >>CPU Architecture Modelling and Co-design
2023 (English)In: High Performance Computing - 38th International Conference, ISC High Performance 2023, Proceedings, Springer Nature , 2023, p. 3-21Conference paper, Published paper (Refereed)
Abstract [en]

Co-design has become an established process for both developing high-performance computing (HPC) architectures (and, more specifically, CPU architectures) as well as HPC applications. The co-design process is frequently based on models. This paper discusses an approach to CPU architecture modelling and its relation to modelling theory. The approach is implemented using the gem5 simulator for Arm-based CPU architectures and applied for the purpose of generating co-design knowledge using two applications that are widely used on HPC systems.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Arm, computer architecture modelling, computer architecture simulation, gem5, GPAW, Graviton 2, GROMACS, HPC applications, HPC architectures
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:kth:diva-338629 (URN)10.1007/978-3-031-32041-5_1 (DOI)2-s2.0-85161134699 (Scopus ID)
Conference
38th International Conference on High Performance Computing, ISC High Performance 2023, Hamburg, Germany, May 21 2023 - May 25 2023
Note

Part of ISBN 9783031320408

QC 20231102

Available from: 2023-11-02 Created: 2023-11-02 Last updated: 2023-11-02Bibliographically approved
Kunkel, J. M., Boehme, C., Decker, J., Magugliani, F., Pleiter, D., Koller, B., . . . Yaman, B. (2023). DECICE: Device-Edge-Cloud Intelligent Collaboration Framework. In: Proceedings of the 20th ACM International Conference on Computing Frontiers 2023, CF 2023: . Paper presented at 20th ACM International Conference on Computing Frontiers, CF 2023, Bologna, Italy, May 9 2023 - May 11 2023 (pp. 266-271). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>DECICE: Device-Edge-Cloud Intelligent Collaboration Framework
Show others...
2023 (English)In: Proceedings of the 20th ACM International Conference on Computing Frontiers 2023, CF 2023, Association for Computing Machinery (ACM) , 2023, p. 266-271Conference paper, Published paper (Refereed)
Abstract [en]

DECICE is a Horizon Europe project that is developing an AI-enabled open and portable management framework for automatic and adaptive optimization and deployment of applications in computing continuum encompassing from IoT sensors on the Edge to large-scale Cloud/HPC computing infrastructures. In this paper, we describe the DECICE framework and architecture. Furthermore, we highlight use-cases for framework evaluation: intelligent traffic intersection, magnetic resonance imaging, and emergency response.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
AI-enabled Computing Continuum, Cloud-Edge Orchestration, Cognitive Cloud, Digital Twin, KubeEdge
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-336730 (URN)10.1145/3587135.3592179 (DOI)001116950900044 ()2-s2.0-85169615382 (Scopus ID)
Conference
20th ACM International Conference on Computing Frontiers, CF 2023, Bologna, Italy, May 9 2023 - May 11 2023
Note

Part of ISBN 9798400701405

QC 20230919

Available from: 2023-09-19 Created: 2023-09-19 Last updated: 2024-03-12Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-7296-7817

Search in DiVA

Show all publications