kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Kub: Enabling Elastic HPC Workloads on Containerized Environments
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0002-1434-3042
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0003-1669-7714
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0009-0003-6504-7109
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0003-4158-3583
2023 (English)In: Proceedings of the 35th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Institute of Electrical and Electronics Engineers (IEEE), 2023Conference paper, Published paper (Refereed)
Abstract [en]

The conventional model of resource allocation in HPC systems is static. Thus, a job cannot leverage newly available resources in the system or release underutilized resources during the execution. In this paper, we present Kub, a methodology that enables elastic execution of HPC workloads on Kubernetes so that the resources allocated to a job can be dynamically scaled during the execution. One main optimization of our method is to maximize the reuse of the originally allocated resources so that the disruption to the running job can be minimized. The scaling procedure is coordinated among nodes through remote procedure calls on Kubernetes for deploying workloads in the cloud. We evaluate our approach using one synthetic benchmark and two production-level MPI-based HPC applications - GRO-MACS and CM1. Our results demonstrate that the benefits of adapting the allocated resources depend on the workload characteristics. In the tested cases, a properly chosen scaling point for increasing resources during execution achieved up to 2x speedup. Also, the overhead of checkpointing and data reshuffling significantly influences the selection of optimal scaling points and requires application-specific knowledge.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023.
Keywords [en]
HPC, Cloud, scaling, Kubernetes, Elasticity, Malleability
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-339917DOI: 10.1109/SBAC-PAD59825.2023.00031ISI: 001103378300022Scopus ID: 2-s2.0-85178503556OAI: oai:DiVA.org:kth-339917DiVA, id: diva2:1813759
Conference
35th IEEE International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2023, Porto Alegre, Brazil, October 17-20, 2023
Funder
European Commission
Note

Part of ISBN 979-8-3503-0548-7

QC 20231122

Available from: 2023-11-21 Created: 2023-11-21 Last updated: 2025-12-05Bibliographically approved
In thesis
1. Emerging Paradigms in the Convergence of Cloud and High-Performance Computing
Open this publication in new window or tab >>Emerging Paradigms in the Convergence of Cloud and High-Performance Computing
2023 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

Traditional HPC scientific workloads are tightly coupled, while emerging scientific workflows exhibit even more complex patterns, consisting of multiple characteristically different stages that may be IO-intensive, compute-intensive, or memory-intensive. New high-performance computer systems are evolving to adapt to these new requirements and are motivated by the need for performance and efficiency in resource usage. On the other hand, cloud workloads are loosely coupled, and their systems have matured technologies under different constraints from HPC.

In this thesis, the use of cloud technologies designed for loosely coupled dynamic and elastic workloads is explored, repurposed, and examined in the landscape of HPC in three major parts. The first part deals with the deployment of HPC workloads in cloud-native environments through the use of containers and analyses the feasibility and trade-offs of elastic scaling. The second part relates to the use of workflow management systems in HPC workflows; in particular, a molecular docking workflow executed through Airflow is discussed. Finally, object storage systems, a cost-effective and scalable solution widely used in the cloud, and their usage in HPC applications through MPI I/O are discussed in the third part of this thesis. 

Abstract [sv]

Framväxande vetenskapliga applikationer är mycket datatunga och starkt kopplade. Nya högpresterande datorsystem anpassar sig till dessa nya krav och motiveras av behovet av prestanda och effektivitet i resursanvändningen. Å andra sidan är moln-applikationer löst kopplade och deras system har mogna teknologier som utvecklats under andra begränsningar än HPC.

I den här avhandlingen diskuteras användningen av moln-teknologier som har mognat under löst kopplade applikationer i HPC-landskapet i tre huvuddelar. Den första delen handlar om implementeringen av HPC-applikationer i molnmiljöer genom användning av containrar och analyserar genomförbarheten och avvägningarna av elastisk skalning. Den andra delen handlar om användningen av arbetsflödeshanteringsystem i HPC-arbetsflöden; särskilt diskuteras ett molekylär dockningsarbetsflöde som utförs genom Airflow. Objektlagringssystem och deras användning inom HPC, tillsammans med ett gränssnitt mellan S3-standard och MPI I/O, diskuteras i den tredje delen av denna avhandling

Place, publisher, year, edition, pages
Stockholm, Sweden: KTH Royal Institute of Technology, 2023. p. v, 53
Series
TRITA-EECS-AVL ; 2023:80
Keywords
High-performance computing, Kubernetes, airflow, elastic scaling, MPI, S3
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-339918 (URN)978-91-8040-753-3 (ISBN)
Presentation
2023-12-15, Visualization Studio, Lindstedtsvägen 9, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

QC 20231122

Available from: 2023-11-22 Created: 2023-11-21 Last updated: 2023-11-22Bibliographically approved
2. Towards Adaptive Resource Management for HPC Workloads in Cloud Environments
Open this publication in new window or tab >>Towards Adaptive Resource Management for HPC Workloads in Cloud Environments
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Maximizing resource efficiency is crucial when designing cloud-based systems,which are primarily built to meet specific quality-of-service requirements.Common optimization techniques include containerization, workflow orchestration,elasticity, and vertical scaling, all aimed at improving resource utilizationand reducing costs. In contrast, on-premises high-performance computingsystems prioritize maximum performance, typically relying on static resourceallocation. While this approach offers certain advantages over cloud systems,it can be restrictive in handling the increasingly dynamic resource demands oftightly coupled HPC workloads, making adaptive resource management challenging.

This thesis explores the execution of high-performance workloads in cloudbasedenvironments, investigating both horizontal and vertical scaling strategiesas well as the feasibility of running HPC workflows in the cloud. Additionally,we will evaluate the costs of deploying these workloads in containerizedenvironments and examine the advantages of using object storagein cloud-based HPC systems.

Abstract [sv]

Att maximera resurseffektiviteten ar avgörande vid utformningen av molnbaserade system, som framst byggs för att uppfylla specifika krav på tjänstekvalitet. Vanliga optimeringstekniker inkluderar containerisering, arbetsflödesorkestrering, elasticitet och vertikal skalning, med målet att förbättra resursutnyttjandet och minska kostnaderna. I kontrast fokuserar lokala högprestandaberäkningssystem (HPC) på maximal prestanda och förlitar sig oftast på statisk resursallokering. Även om denna strategi har vissa fördelar jämfört med molnlösningar, kan den vara begränsande när det gäller att hantera de allt mer dynamiska resursbehoven hos tätt sammankopplade HPC-arbetslaster, vilket gör adaptiv resursförvaltning utmanande. Denna avhandling undersöker körningen av högprestandaarbetslaster i molnbaserade miljöer, med fokus på både horisontell och vertikal skalning samt möjligheten att köra HPC-arbetsflöden i molnet. Dessutom kommer vi att analysera kostnaderna for att distribuera dessa arbetslaster i containeriserade miljöer och utvärdera fördelarna med att använda objektlagring i molnbaserade HPC-system.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2025. p. 91
Series
TRITA-EECS-AVL ; 2025:51
Keywords
high-performance computing, resource adaptability, cloud computing, containers, horizontal scaling, vertical scaling, object storage, Högprestandaberäkning, resursanpassningsförmåga, molnberäkning, containerisering, horisontell skalning, vertikal skalning, objektlagring
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363164 (URN)978-91-8106-279-3 (ISBN)
Public defence
2025-06-02, E2, Lindstedtsvägen 3, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20250506

Available from: 2025-05-06 Created: 2025-05-06 Last updated: 2025-05-06Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Araújo De Medeiros, DanielWahlgren, JacobSchieffer, GabinPeng, Ivy Bo

Search in DiVA

By author/editor
Araújo De Medeiros, DanielWahlgren, JacobSchieffer, GabinPeng, Ivy Bo
By organisation
Computational Science and Technology (CST)
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 1040 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf