kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Online Policy Adaptation for Networked Systems using Rollout
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0002-6343-7416
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0003-1773-8354
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0001-6039-8493
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Dynamic resource allocation in networked systems is needed to continuously achieve end-to-end management objectives. Recent research has shown that reinforcement learning can achieve near-optimal resource allocation policies for realistic system configurations. However, most current solutions require expensive retraining when changes in the system occur. We address this problem and introduce an efficient method to adapt a given base policy to system changes, e.g., to a change in the service offering. In our approach, we adapt a base control policy using a rollout mechanism, which transforms the base policy into an improved rollout policy. We perform extensive evaluations on a testbed where we run applications on a service mesh based on the Istio and Kubernetes platforms. The experiments provide insights into the performance of different rollout algorithms. We find that our approach produces policies that are equally effective as those obtained by offline retraining. On our testbed, effective policy adaptation takes seconds when using rollout, compared to minutes or hours when using retraining. Our work demonstrates that rollout, which has been applied successfully in other domains, is an effective approach for policy adaptation in networked systems.

Place, publisher, year, edition, pages
2024.
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:kth:diva-346582OAI: oai:DiVA.org:kth-346582DiVA, id: diva2:1858750
Conference
IEEE/IFIP Network Operations and Management Symposium 6–10 May 2024, Seoul, South Korea
Note

QC 20240522

Available from: 2024-05-18 Created: 2024-05-18 Last updated: 2024-06-10Bibliographically approved
In thesis
1. End-to-end performance prediction and automated resource management of cloud services
Open this publication in new window or tab >>End-to-end performance prediction and automated resource management of cloud services
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Cloud-based services are integral to modern life. Cloud systems aim to provide customers with uninterrupted services of high quality while enabling cost-effective fulfillment by providers. The key to meet quality requirements and end-to-end performance objectives is to devise effective strategies to allocate resources to the services. This in turn requires automation of resource allocation. Recently, researchers have studied learning-based approaches, especially reinforcement learning (RL) for automated resource allocation. These approaches are particularly promising to perform resource allocation in cloud systems as they allow to deal with the architectural complexity of a cloud environment. Previous research shows that reinforcement learning is effective for specific types of controls, such as horizontal or vertical scaling of compute resources. Major obstacles for operational deployment remain however. Chief among them is the fact that reinforcement learning methods require long times for training and retraining after system changes. 

With this thesis, we aim to overcome these obstacles and demonstrate dynamic resource allocation using reinforcement learning on a testbed. On the conceptual level, we address two interconnected problems: predicting end-to-end service performance and automated resource allocation for cloud services. First, we study methods to predict the conditional density of service metrics and demonstrate the effectiveness of employing dimensionality reduction methods to reduce monitoring, communication, and model-training overhead. For automated resource allocation, we develop a framework for RL-based control. Our approach involves learning a system model from measurements, using a simulator to learn resource allocation policies, and adapting these policies online using a rollout mechanism. Experimental results from our testbed show that using our framework, we can effectively achieve end-to-end performance objectives by dynamically allocating resources to the services using different types of control actions simultaneously.

Abstract [sv]

Molnbaserade tjänster är integrerade i det moderna livet. Molnsystem kan erbjuda oavbrutna tjänster av hög kvalitet samtidigt som de möjliggör kostnadseffektiv implementation och operation. Nyckeln för att uppfylla kvalitetskrav och prestandamål för molntjänster är att utveckla effektiva strategier för att tilldela resurser till tjänsterna. Detta kräver i sin tur automatisering, vilket är särskilt viktigt i delade infrastrukturer som rymmer olika tjänster. Aktuell forskning inom området studerar metoder baserade på förstärkningsinlärning (RL) för automatisk resursallokering. Dessa metoder är speciellt väl anpassade för molnmiljöer eftersom att de kan hantera den arkitektoniska komplexitet som är typisk för molnmiljöer. Resultat från tidigare forskning visar att RL är en effektiv metod för specifika typer av kontrollaktioner, såsom horisontell eller vertikal skalning av beräkningsresurser. Viktiga utmaningar för att implementera RL i operativa system kvarstår dock. Bland dessa är det faktum att RL kräver långa optimeringstider samt att optimeringen måste göras om vid varje systemförändring.

Med denna avhandling syftar vi till att övervinna dessa hinder och demonstrera dynamisk resurstilldelning med RL på en testbädd. På konceptuell nivå behandlar vi två sammanlänkade problem: att förutsäga prestandametriker samt automatiserad resurstilldelning för molntjänster. Först studerar vi metoder för att förutsäga den villkorliga sannolikheten av prestandametriker och demonstrerar effektiviteten av att använda dimensionsreduktion för att minska kostnaden av modellträning. Sedan utvecklar vi ett ramverk för automatiserad resurstilldelning baserat på RL. Vårt ramverk inkluderar att lära sig en systemmodell från mätningar, att använda en simulator för att lära sig resurstilldelningspolicyer samt att anpassa dessa policyer online med hjälp av en rollout-mekanism. Experimentella resultat från vår testbädd visar att genom att använda vårt ramverk kan vi effektivt uppnå prestandamål genom att automatiskt utföra kontrollaktioner för att dynamiskt tilldela resurser til tjänsterna.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2024
Series
TRITA-EECS-AVL ; 2024:42
Keywords
Network management automation, performance management, reinforcement learning, multi-dimensional control, online policy adaptation, end-to-end quality of service estimation.
National Category
Computer Systems
Research subject
Electrical Engineering
Identifiers
urn:nbn:se:kth:diva-346585 (URN)978-91-8040-917-9 (ISBN)
Public defence
2024-06-10, Q2, Malvinas väg 10, STOCKHOLM, 10:00 (English)
Opponent
Supervisors
Available from: 2024-05-22 Created: 2024-05-18 Last updated: 2024-06-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Conference

Authority records

Shahabsamani, ForoughHammar, KimStadler, Rolf

Search in DiVA

By author/editor
Shahabsamani, ForoughHammar, KimStadler, Rolf
By organisation
Network and Systems Engineering
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 131 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf