kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Conditional Density Estimation of Service Metrics for Networked Services
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering. RISE Res Inst Sweden, Dept Digital Syst, S-50115 Borås, Sweden..ORCID iD: 0000-0002-6343-7416
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering. RISE Res Inst Sweden, Dept Digital Syst, S-50115 Borås, Sweden..ORCID iD: 0000-0001-6039-8493
Ericsson Res, RA Machine Intelligence, S-16480 Stockholm, Sweden..
Ericsson Res, Res Area Artificial Intelligence Dept, S-16483 Stockholm, Sweden..
2021 (English)In: IEEE Transactions on Network and Service Management, E-ISSN 1932-4537, Vol. 18, no 2, p. 2350-2364Article in journal (Refereed) Published
Abstract [en]

We predict the conditional distributions of service metrics, such as response time or frame rate, from infrastructure measurements in a networked environment. From such distributions, key statistics of the service metrics, including mean, variance, or quantiles can be computed, which are essential for predicting SLA conformance and enabling service assurance. We present and assess two methods for prediction: (1) mixture models with Gaussian or Lognormal kernels, whose parameters are estimated using mixture density networks, a class of neural networks, and (2) histogram models, which require the target space to be discretized. We apply these methods to a VoD service and a KV store service running on our lab testbed. A comparative evaluation shows the relative effectiveness of the methods when applied to operational data. We find that both methods allow for accurate prediction. While mixture models provide a general and elegant solution, they incur a very high overhead related to hyper-parameter search and neural network training. Histogram models, on the other hand, allow for efficient training, but require adjustment to the specific use case.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2021. Vol. 18, no 2, p. 2350-2364
Keywords [en]
Measurement, Kernel, Histograms, Data models, Mixture models, Estimation, Time factors, KPI prediction, conditional density estimation (CDE), machine learning, statistical learning, service engineering, network management
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:kth:diva-298658DOI: 10.1109/TNSM.2021.3077357ISI: 000660636700085Scopus ID: 2-s2.0-85105891540OAI: oai:DiVA.org:kth-298658DiVA, id: diva2:1579716
Note

QC 20210710

Available from: 2021-07-10 Created: 2021-07-10 Last updated: 2024-07-04Bibliographically approved
In thesis
1. End-to-end performance prediction and automated resource management of cloud services
Open this publication in new window or tab >>End-to-end performance prediction and automated resource management of cloud services
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Cloud-based services are integral to modern life. Cloud systems aim to provide customers with uninterrupted services of high quality while enabling cost-effective fulfillment by providers. The key to meet quality requirements and end-to-end performance objectives is to devise effective strategies to allocate resources to the services. This in turn requires automation of resource allocation. Recently, researchers have studied learning-based approaches, especially reinforcement learning (RL) for automated resource allocation. These approaches are particularly promising to perform resource allocation in cloud systems as they allow to deal with the architectural complexity of a cloud environment. Previous research shows that reinforcement learning is effective for specific types of controls, such as horizontal or vertical scaling of compute resources. Major obstacles for operational deployment remain however. Chief among them is the fact that reinforcement learning methods require long times for training and retraining after system changes. 

With this thesis, we aim to overcome these obstacles and demonstrate dynamic resource allocation using reinforcement learning on a testbed. On the conceptual level, we address two interconnected problems: predicting end-to-end service performance and automated resource allocation for cloud services. First, we study methods to predict the conditional density of service metrics and demonstrate the effectiveness of employing dimensionality reduction methods to reduce monitoring, communication, and model-training overhead. For automated resource allocation, we develop a framework for RL-based control. Our approach involves learning a system model from measurements, using a simulator to learn resource allocation policies, and adapting these policies online using a rollout mechanism. Experimental results from our testbed show that using our framework, we can effectively achieve end-to-end performance objectives by dynamically allocating resources to the services using different types of control actions simultaneously.

Abstract [sv]

Molnbaserade tjänster är integrerade i det moderna livet. Molnsystem kan erbjuda oavbrutna tjänster av hög kvalitet samtidigt som de möjliggör kostnadseffektiv implementation och operation. Nyckeln för att uppfylla kvalitetskrav och prestandamål för molntjänster är att utveckla effektiva strategier för att tilldela resurser till tjänsterna. Detta kräver i sin tur automatisering, vilket är särskilt viktigt i delade infrastrukturer som rymmer olika tjänster. Aktuell forskning inom området studerar metoder baserade på förstärkningsinlärning (RL) för automatisk resursallokering. Dessa metoder är speciellt väl anpassade för molnmiljöer eftersom att de kan hantera den arkitektoniska komplexitet som är typisk för molnmiljöer. Resultat från tidigare forskning visar att RL är en effektiv metod för specifika typer av kontrollaktioner, såsom horisontell eller vertikal skalning av beräkningsresurser. Viktiga utmaningar för att implementera RL i operativa system kvarstår dock. Bland dessa är det faktum att RL kräver långa optimeringstider samt att optimeringen måste göras om vid varje systemförändring.

Med denna avhandling syftar vi till att övervinna dessa hinder och demonstrera dynamisk resurstilldelning med RL på en testbädd. På konceptuell nivå behandlar vi två sammanlänkade problem: att förutsäga prestandametriker samt automatiserad resurstilldelning för molntjänster. Först studerar vi metoder för att förutsäga den villkorliga sannolikheten av prestandametriker och demonstrerar effektiviteten av att använda dimensionsreduktion för att minska kostnaden av modellträning. Sedan utvecklar vi ett ramverk för automatiserad resurstilldelning baserat på RL. Vårt ramverk inkluderar att lära sig en systemmodell från mätningar, att använda en simulator för att lära sig resurstilldelningspolicyer samt att anpassa dessa policyer online med hjälp av en rollout-mekanism. Experimentella resultat från vår testbädd visar att genom att använda vårt ramverk kan vi effektivt uppnå prestandamål genom att automatiskt utföra kontrollaktioner för att dynamiskt tilldela resurser til tjänsterna.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2024
Series
TRITA-EECS-AVL ; 2024:42
Keywords
Network management automation, performance management, reinforcement learning, multi-dimensional control, online policy adaptation, end-to-end quality of service estimation.
National Category
Computer Systems
Research subject
Electrical Engineering
Identifiers
urn:nbn:se:kth:diva-346585 (URN)978-91-8040-917-9 (ISBN)
Public defence
2024-06-10, Q2, Malvinas väg 10, STOCKHOLM, 10:00 (English)
Opponent
Supervisors
Available from: 2024-05-22 Created: 2024-05-18 Last updated: 2024-06-10Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Samani, Forough ShahabStadler, Rolf

Search in DiVA

By author/editor
Samani, Forough ShahabStadler, Rolf
By organisation
Network and Systems Engineering
In the same journal
IEEE Transactions on Network and Service Management
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 133 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf