kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Framework for dynamically meeting performance objectives on a service mesh
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0002-6343-7416
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0001-6039-8493
2024 (English)Manuscript (preprint) (Other academic)
Abstract [en]

We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. Our framework supports the design of a control agent for a given management objective. It is novel in that it advocates a top-down approach whereby the management objective is defined first and then mapped onto the available control actions. Several types of control actions can be executed simultaneously, which allows for efficient resource utilization. Second, the framework separates learning of the system model and the operating region from learning of the control policy. By first learning the system model and the operating region from testbed traces, we can instantiate a simulator and train the agent for different management objectives in parallel. Third, the use of a simulator shortens the training time by orders of magnitude compared with training the agent on the testbed. We evaluate the learned policies on the testbed and show the effectiveness of our approach in several scenarios. In one scenario, we design a controller that achieves the management objectives with $50\%$ less system resources than Kubernetes HPA autoscaling.

Place, publisher, year, edition, pages
2024.
National Category
Engineering and Technology Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-346583OAI: oai:DiVA.org:kth-346583DiVA, id: diva2:1858751
Note

QC 20240522

Available from: 2024-05-18 Created: 2024-05-18 Last updated: 2024-05-22Bibliographically approved
In thesis
1. End-to-end performance prediction and automated resource management of cloud services
Open this publication in new window or tab >>End-to-end performance prediction and automated resource management of cloud services
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Cloud-based services are integral to modern life. Cloud systems aim to provide customers with uninterrupted services of high quality while enabling cost-effective fulfillment by providers. The key to meet quality requirements and end-to-end performance objectives is to devise effective strategies to allocate resources to the services. This in turn requires automation of resource allocation. Recently, researchers have studied learning-based approaches, especially reinforcement learning (RL) for automated resource allocation. These approaches are particularly promising to perform resource allocation in cloud systems as they allow to deal with the architectural complexity of a cloud environment. Previous research shows that reinforcement learning is effective for specific types of controls, such as horizontal or vertical scaling of compute resources. Major obstacles for operational deployment remain however. Chief among them is the fact that reinforcement learning methods require long times for training and retraining after system changes. 

With this thesis, we aim to overcome these obstacles and demonstrate dynamic resource allocation using reinforcement learning on a testbed. On the conceptual level, we address two interconnected problems: predicting end-to-end service performance and automated resource allocation for cloud services. First, we study methods to predict the conditional density of service metrics and demonstrate the effectiveness of employing dimensionality reduction methods to reduce monitoring, communication, and model-training overhead. For automated resource allocation, we develop a framework for RL-based control. Our approach involves learning a system model from measurements, using a simulator to learn resource allocation policies, and adapting these policies online using a rollout mechanism. Experimental results from our testbed show that using our framework, we can effectively achieve end-to-end performance objectives by dynamically allocating resources to the services using different types of control actions simultaneously.

Abstract [sv]

Molnbaserade tjänster är integrerade i det moderna livet. Molnsystem kan erbjuda oavbrutna tjänster av hög kvalitet samtidigt som de möjliggör kostnadseffektiv implementation och operation. Nyckeln för att uppfylla kvalitetskrav och prestandamål för molntjänster är att utveckla effektiva strategier för att tilldela resurser till tjänsterna. Detta kräver i sin tur automatisering, vilket är särskilt viktigt i delade infrastrukturer som rymmer olika tjänster. Aktuell forskning inom området studerar metoder baserade på förstärkningsinlärning (RL) för automatisk resursallokering. Dessa metoder är speciellt väl anpassade för molnmiljöer eftersom att de kan hantera den arkitektoniska komplexitet som är typisk för molnmiljöer. Resultat från tidigare forskning visar att RL är en effektiv metod för specifika typer av kontrollaktioner, såsom horisontell eller vertikal skalning av beräkningsresurser. Viktiga utmaningar för att implementera RL i operativa system kvarstår dock. Bland dessa är det faktum att RL kräver långa optimeringstider samt att optimeringen måste göras om vid varje systemförändring.

Med denna avhandling syftar vi till att övervinna dessa hinder och demonstrera dynamisk resurstilldelning med RL på en testbädd. På konceptuell nivå behandlar vi två sammanlänkade problem: att förutsäga prestandametriker samt automatiserad resurstilldelning för molntjänster. Först studerar vi metoder för att förutsäga den villkorliga sannolikheten av prestandametriker och demonstrerar effektiviteten av att använda dimensionsreduktion för att minska kostnaden av modellträning. Sedan utvecklar vi ett ramverk för automatiserad resurstilldelning baserat på RL. Vårt ramverk inkluderar att lära sig en systemmodell från mätningar, att använda en simulator för att lära sig resurstilldelningspolicyer samt att anpassa dessa policyer online med hjälp av en rollout-mekanism. Experimentella resultat från vår testbädd visar att genom att använda vårt ramverk kan vi effektivt uppnå prestandamål genom att automatiskt utföra kontrollaktioner för att dynamiskt tilldela resurser til tjänsterna.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2024
Series
TRITA-EECS-AVL ; 2024:42
Keywords
Network management automation, performance management, reinforcement learning, multi-dimensional control, online policy adaptation, end-to-end quality of service estimation.
National Category
Computer Systems
Research subject
Electrical Engineering
Identifiers
urn:nbn:se:kth:diva-346585 (URN)978-91-8040-917-9 (ISBN)
Public defence
2024-06-10, Q2, Malvinas väg 10, STOCKHOLM, 10:00 (English)
Opponent
Supervisors
Available from: 2024-05-22 Created: 2024-05-18 Last updated: 2024-06-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Samani, Forough ShahabStadler, Rolf

Search in DiVA

By author/editor
Samani, Forough ShahabStadler, Rolf
By organisation
Network and Systems Engineering
Engineering and TechnologyComputer Systems

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 126 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf