kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
End-to-end performance prediction and automated resource management of cloud services
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0002-6343-7416
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Cloud-based services are integral to modern life. Cloud systems aim to provide customers with uninterrupted services of high quality while enabling cost-effective fulfillment by providers. The key to meet quality requirements and end-to-end performance objectives is to devise effective strategies to allocate resources to the services. This in turn requires automation of resource allocation. Recently, researchers have studied learning-based approaches, especially reinforcement learning (RL) for automated resource allocation. These approaches are particularly promising to perform resource allocation in cloud systems as they allow to deal with the architectural complexity of a cloud environment. Previous research shows that reinforcement learning is effective for specific types of controls, such as horizontal or vertical scaling of compute resources. Major obstacles for operational deployment remain however. Chief among them is the fact that reinforcement learning methods require long times for training and retraining after system changes. 

With this thesis, we aim to overcome these obstacles and demonstrate dynamic resource allocation using reinforcement learning on a testbed. On the conceptual level, we address two interconnected problems: predicting end-to-end service performance and automated resource allocation for cloud services. First, we study methods to predict the conditional density of service metrics and demonstrate the effectiveness of employing dimensionality reduction methods to reduce monitoring, communication, and model-training overhead. For automated resource allocation, we develop a framework for RL-based control. Our approach involves learning a system model from measurements, using a simulator to learn resource allocation policies, and adapting these policies online using a rollout mechanism. Experimental results from our testbed show that using our framework, we can effectively achieve end-to-end performance objectives by dynamically allocating resources to the services using different types of control actions simultaneously.

Abstract [sv]

Molnbaserade tjänster är integrerade i det moderna livet. Molnsystem kan erbjuda oavbrutna tjänster av hög kvalitet samtidigt som de möjliggör kostnadseffektiv implementation och operation. Nyckeln för att uppfylla kvalitetskrav och prestandamål för molntjänster är att utveckla effektiva strategier för att tilldela resurser till tjänsterna. Detta kräver i sin tur automatisering, vilket är särskilt viktigt i delade infrastrukturer som rymmer olika tjänster. Aktuell forskning inom området studerar metoder baserade på förstärkningsinlärning (RL) för automatisk resursallokering. Dessa metoder är speciellt väl anpassade för molnmiljöer eftersom att de kan hantera den arkitektoniska komplexitet som är typisk för molnmiljöer. Resultat från tidigare forskning visar att RL är en effektiv metod för specifika typer av kontrollaktioner, såsom horisontell eller vertikal skalning av beräkningsresurser. Viktiga utmaningar för att implementera RL i operativa system kvarstår dock. Bland dessa är det faktum att RL kräver långa optimeringstider samt att optimeringen måste göras om vid varje systemförändring.

Med denna avhandling syftar vi till att övervinna dessa hinder och demonstrera dynamisk resurstilldelning med RL på en testbädd. På konceptuell nivå behandlar vi två sammanlänkade problem: att förutsäga prestandametriker samt automatiserad resurstilldelning för molntjänster. Först studerar vi metoder för att förutsäga den villkorliga sannolikheten av prestandametriker och demonstrerar effektiviteten av att använda dimensionsreduktion för att minska kostnaden av modellträning. Sedan utvecklar vi ett ramverk för automatiserad resurstilldelning baserat på RL. Vårt ramverk inkluderar att lära sig en systemmodell från mätningar, att använda en simulator för att lära sig resurstilldelningspolicyer samt att anpassa dessa policyer online med hjälp av en rollout-mekanism. Experimentella resultat från vår testbädd visar att genom att använda vårt ramverk kan vi effektivt uppnå prestandamål genom att automatiskt utföra kontrollaktioner för att dynamiskt tilldela resurser til tjänsterna.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2024.
Series
TRITA-EECS-AVL ; 2024:42
Keywords [en]
Network management automation, performance management, reinforcement learning, multi-dimensional control, online policy adaptation, end-to-end quality of service estimation.
National Category
Computer Systems
Research subject
Electrical Engineering
Identifiers
URN: urn:nbn:se:kth:diva-346585ISBN: 978-91-8040-917-9 (print)OAI: oai:DiVA.org:kth-346585DiVA, id: diva2:1858755
Public defence
2024-06-10, Q2, Malvinas väg 10, STOCKHOLM, 10:00 (English)
Opponent
Supervisors
Available from: 2024-05-22 Created: 2024-05-18 Last updated: 2024-06-10Bibliographically approved
List of papers
1. Conditional Density Estimation of Service Metrics for Networked Services
Open this publication in new window or tab >>Conditional Density Estimation of Service Metrics for Networked Services
2021 (English)In: IEEE Transactions on Network and Service Management, E-ISSN 1932-4537, Vol. 18, no 2, p. 2350-2364Article in journal (Refereed) Published
Abstract [en]

We predict the conditional distributions of service metrics, such as response time or frame rate, from infrastructure measurements in a networked environment. From such distributions, key statistics of the service metrics, including mean, variance, or quantiles can be computed, which are essential for predicting SLA conformance and enabling service assurance. We present and assess two methods for prediction: (1) mixture models with Gaussian or Lognormal kernels, whose parameters are estimated using mixture density networks, a class of neural networks, and (2) histogram models, which require the target space to be discretized. We apply these methods to a VoD service and a KV store service running on our lab testbed. A comparative evaluation shows the relative effectiveness of the methods when applied to operational data. We find that both methods allow for accurate prediction. While mixture models provide a general and elegant solution, they incur a very high overhead related to hyper-parameter search and neural network training. Histogram models, on the other hand, allow for efficient training, but require adjustment to the specific use case.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
Measurement, Kernel, Histograms, Data models, Mixture models, Estimation, Time factors, KPI prediction, conditional density estimation (CDE), machine learning, statistical learning, service engineering, network management
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-298658 (URN)10.1109/TNSM.2021.3077357 (DOI)000660636700085 ()2-s2.0-85105891540 (Scopus ID)
Note

QC 20210710

Available from: 2021-07-10 Created: 2021-07-10 Last updated: 2024-07-04Bibliographically approved
2. Efficient Learning on High-dimensional Operational Data
Open this publication in new window or tab >>Efficient Learning on High-dimensional Operational Data
2019 (English)In: 2019 15th International conference on network and service management (CNSM) / [ed] Lutfiyya, H Diao, YX Zincir-Heywood, N Badonnel, R Madeira, E, IEEE , 2019Conference paper, Published paper (Refereed)
Abstract [en]

In networked systems engineering, operational data gathered from sensors or logs can be used to build data-driven functions for performance prediction, anomaly detection, and other operational tasks. The number of data sources used for this purpose determines the dimensionality of the feature space for learning and can reach millions for medium-sized systems. Learning on a space with high dimensionality generally incurs high communication and computational costs for the learning process. In this work, we apply and compare a range of methods, including, feature selection, Principle Component Analysis (PCA), and autoencoders with the objective to reduce the dimensionality of the feature space while maintaining the prediction accuracy when compared with learning on the full space. We conduct the study using traces gathered from a test-bed at KTH that runs a video-on-demand service and a key-value store under dynamic load. Our results suggest the feasibility of reducing the dimensionality of the feature space of operational data significantly, by one to two orders of magnitude in our scenarios, while maintaining prediction accuracy. The findings confirm the Manifold Hypothesis in machine learning, which states that real-world data sets tend to occupy a small subspace of the full feature space. In addition, we investigate the tradeoff between prediction accuracy and prediction overhead, which is crucial for applying the results to operational systems.

Place, publisher, year, edition, pages
IEEE, 2019
Series
International Conference on Network and Service Management, ISSN 2165-9605
Keywords
Data-driven engineering, Machine learning, ML, Dimensionality reduction
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-285610 (URN)10.23919/CNSM46954.2019.9012741 (DOI)000552229800074 ()2-s2.0-85081966035 (Scopus ID)
Conference
15th Int Conf on Network and Serv Management (CNSM) / 1st Int Workshop on Analyt for Serv and Application Management (AnServApp) / Int Workshop on High-Precision Networks Operat and Control, Segment Routing and Serv Function Chaining (HiPNet+SR/SFC), OCT 21-25, 2019, Halifax, CANADA
Note

QC 20201111

Part of ISBN 978-3-903176-24-9

Available from: 2020-11-11 Created: 2020-11-11 Last updated: 2024-10-15Bibliographically approved
3. A Framework for dynamically meeting performance objectives on a service mesh
Open this publication in new window or tab >>A Framework for dynamically meeting performance objectives on a service mesh
2024 (English)Manuscript (preprint) (Other academic)
Abstract [en]

We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. Our framework supports the design of a control agent for a given management objective. It is novel in that it advocates a top-down approach whereby the management objective is defined first and then mapped onto the available control actions. Several types of control actions can be executed simultaneously, which allows for efficient resource utilization. Second, the framework separates learning of the system model and the operating region from learning of the control policy. By first learning the system model and the operating region from testbed traces, we can instantiate a simulator and train the agent for different management objectives in parallel. Third, the use of a simulator shortens the training time by orders of magnitude compared with training the agent on the testbed. We evaluate the learned policies on the testbed and show the effectiveness of our approach in several scenarios. In one scenario, we design a controller that achieves the management objectives with $50\%$ less system resources than Kubernetes HPA autoscaling.

National Category
Engineering and Technology Computer Systems
Identifiers
urn:nbn:se:kth:diva-346583 (URN)
Note

QC 20240522

Available from: 2024-05-18 Created: 2024-05-18 Last updated: 2024-05-22Bibliographically approved
4. Online Policy Adaptation for Networked Systems using Rollout
Open this publication in new window or tab >>Online Policy Adaptation for Networked Systems using Rollout
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Dynamic resource allocation in networked systems is needed to continuously achieve end-to-end management objectives. Recent research has shown that reinforcement learning can achieve near-optimal resource allocation policies for realistic system configurations. However, most current solutions require expensive retraining when changes in the system occur. We address this problem and introduce an efficient method to adapt a given base policy to system changes, e.g., to a change in the service offering. In our approach, we adapt a base control policy using a rollout mechanism, which transforms the base policy into an improved rollout policy. We perform extensive evaluations on a testbed where we run applications on a service mesh based on the Istio and Kubernetes platforms. The experiments provide insights into the performance of different rollout algorithms. We find that our approach produces policies that are equally effective as those obtained by offline retraining. On our testbed, effective policy adaptation takes seconds when using rollout, compared to minutes or hours when using retraining. Our work demonstrates that rollout, which has been applied successfully in other domains, is an effective approach for policy adaptation in networked systems.

National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-346582 (URN)
Conference
IEEE/IFIP Network Operations and Management Symposium 6–10 May 2024, Seoul, South Korea
Note

QC 20240522

Available from: 2024-05-18 Created: 2024-05-18 Last updated: 2024-06-10Bibliographically approved

Open Access in DiVA

Kappa(1190 kB)145 downloads
File information
File name FULLTEXT04.pdfFile size 1190 kBChecksum SHA-512
d25f074e9b080adf41df4f404e846486bdd55e11eb023316fd73a5ee051e1e208d0b3fa9e6544a5cd0438cb4ada7124a5967c9d1e9f4335209ae7ed0a39cb967
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Shahabsamani, Forough
By organisation
Network and Systems Engineering
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 147 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 872 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf