kth.sePublications
Change search
Link to record
Permanent link

Direct link
Samani, Forough ShahabORCID iD iconorcid.org/0000-0002-6343-7416
Alternative names
Publications (10 of 12) Show all publications
Samani, F. S. & Stadler, R. (2024). A Framework for dynamically meeting performance objectives on a service mesh.
Open this publication in new window or tab >>A Framework for dynamically meeting performance objectives on a service mesh
2024 (English)Manuscript (preprint) (Other academic)
Abstract [en]

We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. Our framework supports the design of a control agent for a given management objective. It is novel in that it advocates a top-down approach whereby the management objective is defined first and then mapped onto the available control actions. Several types of control actions can be executed simultaneously, which allows for efficient resource utilization. Second, the framework separates learning of the system model and the operating region from learning of the control policy. By first learning the system model and the operating region from testbed traces, we can instantiate a simulator and train the agent for different management objectives in parallel. Third, the use of a simulator shortens the training time by orders of magnitude compared with training the agent on the testbed. We evaluate the learned policies on the testbed and show the effectiveness of our approach in several scenarios. In one scenario, we design a controller that achieves the management objectives with $50\%$ less system resources than Kubernetes HPA autoscaling.

National Category
Engineering and Technology Computer Systems
Identifiers
urn:nbn:se:kth:diva-346583 (URN)
Note

QC 20240522

Available from: 2024-05-18 Created: 2024-05-18 Last updated: 2024-05-22Bibliographically approved
Samani, F. S., Larsson, H., Damberg, S., Johnsson, A. & Stadler, R. (2024). Comparing Transfer Learning and Rollout for Policy Adaptation in a Changing Network Environment. In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2024, NOMS 2024: . Paper presented at 2024 IEEE/IFIP Network Operations and Management Symposium, NOMS 2024, Seoul, Korea, May 6 2024 - May 10 2024. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Comparing Transfer Learning and Rollout for Policy Adaptation in a Changing Network Environment
Show others...
2024 (English)In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2024, NOMS 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024Conference paper, Published paper (Refereed)
Abstract [en]

Dynamic resource allocation for network services is pivotal for achieving end-to-end management objectives. Previous research has demonstrated that Reinforcement Learning (RL) is a promising approach to resource allocation in networks, allowing to obtain near-optimal control policies for non-trivial system configurations. Current RL approaches however have the drawback that a change in the system or the management objective necessitates expensive retraining of the RL agent. To tackle this challenge, practical solutions including offline retraining, transfer learning, and model-based rollout have been proposed. In this work, we study these methods and present comparative results that shed light on their respective performance and benefits. Our study finds that rollout achieves faster adaptation than transfer learning, yet its effectiveness highly depends on the accuracy of the system model.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
Istio, Kubernetes, Performance management, policy adaptation, reinforcement learning, rollout, service mesh
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-351010 (URN)10.1109/NOMS59830.2024.10575398 (DOI)001270140300103 ()2-s2.0-85198375028 (Scopus ID)
Conference
2024 IEEE/IFIP Network Operations and Management Symposium, NOMS 2024, Seoul, Korea, May 6 2024 - May 10 2024
Note

Part of ISBN 9798350327939

QC 20240725

Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2024-09-27Bibliographically approved
Shahabsamani, F., Hammar, K. & Stadler, R. (2024). Online Policy Adaptation for Networked Systems using Rollout. In: : . Paper presented at IEEE/IFIP Network Operations and Management Symposium 6–10 May 2024, Seoul, South Korea.
Open this publication in new window or tab >>Online Policy Adaptation for Networked Systems using Rollout
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Dynamic resource allocation in networked systems is needed to continuously achieve end-to-end management objectives. Recent research has shown that reinforcement learning can achieve near-optimal resource allocation policies for realistic system configurations. However, most current solutions require expensive retraining when changes in the system occur. We address this problem and introduce an efficient method to adapt a given base policy to system changes, e.g., to a change in the service offering. In our approach, we adapt a base control policy using a rollout mechanism, which transforms the base policy into an improved rollout policy. We perform extensive evaluations on a testbed where we run applications on a service mesh based on the Istio and Kubernetes platforms. The experiments provide insights into the performance of different rollout algorithms. We find that our approach produces policies that are equally effective as those obtained by offline retraining. On our testbed, effective policy adaptation takes seconds when using rollout, compared to minutes or hours when using retraining. Our work demonstrates that rollout, which has been applied successfully in other domains, is an effective approach for policy adaptation in networked systems.

National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-346582 (URN)
Conference
IEEE/IFIP Network Operations and Management Symposium 6–10 May 2024, Seoul, South Korea
Note

QC 20240522

Available from: 2024-05-18 Created: 2024-05-18 Last updated: 2024-06-10Bibliographically approved
Samani, F. S., Hammar, K. & Stadler, R. (2024). Online Policy Adaptation for Networked Systems using Rollout. In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2024, NOMS 2024: . Paper presented at 2024 IEEE/IFIP Network Operations and Management Symposium, NOMS 2024, Seoul, Korea, May 6 2024 - May 10 2024. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Online Policy Adaptation for Networked Systems using Rollout
2024 (English)In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2024, NOMS 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024Conference paper, Published paper (Refereed)
Abstract [en]

Dynamic resource allocation in networked systems is needed to continuously achieve end-to-end management objectives. Recent research has shown that reinforcement learning can achieve near-optimal resource allocation policies for realistic system configurations. However, most current solutions require expensive retraining when changes in the system occur. We address this problem and introduce an efficient method to adapt a given base policy to system changes, e.g., to a change in the service offering. In our approach, we adapt a base control policy using a rollout mechanism, which transforms the base policy into an improved rollout policy. We perform extensive evaluations on a testbed where we run applications on a service mesh based on the Istio and Kubernetes platforms. The experiments provide insights into the performance of different rollout algorithms. We find that our approach produces policies that are equally effective as those obtained by offline retraining. On our testbed, effective policy adaptation takes seconds when using rollout, compared to minutes or hours when using retraining. Our work demonstrates that rollout, which has been applied successfully in other domains, is an effective approach for policy adaptation in networked systems.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
Istio, Kubernetes, Performance management, policy adaptation, reinforcement learning, rollout, service mesh
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-351011 (URN)10.1109/NOMS59830.2024.10575707 (DOI)001270140300173 ()2-s2.0-85198340187 (Scopus ID)
Conference
2024 IEEE/IFIP Network Operations and Management Symposium, NOMS 2024, Seoul, Korea, May 6 2024 - May 10 2024
Note

Part of ISBN 9798350327939

QC 20240725

Available from: 2024-07-24 Created: 2024-07-24 Last updated: 2024-09-27Bibliographically approved
Samani, F. S., Hammar, K. & Stadler, R. (2023). Demonstrating a System for Dynamically Meeting Management Objectives on a Service Mesh. In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2023, NOMS 2023: . Paper presented at 36th IEEE/IFIP Network Operations and Management Symposium, NOMS 2023, Miami, United States of America, May 8 2023 - May 12 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Demonstrating a System for Dynamically Meeting Management Objectives on a Service Mesh
2023 (English)In: Proceedings of IEEE/IFIP Network Operations and Management Symposium 2023, NOMS 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

We demonstrate a management system that lets a service provider achieve end-to-end management objectives under varying load for applications on a service mesh based on the Istio and Kubernetes platforms. The management objectives for the demonstration include end-to-end delay bounds on service requests, throughput objectives, and service differentiation. Our method for finding effective control policies includes a simulator and a control module. The simulator is instantiated with traces from a testbed, and the control module trains a reinforcement learning (RL) agent to efficiently learn effective control policies on the simulator. The learned policies are then transfered to the testbed to perform dynamic control actions based on monitored system metrics. We show that the learned policies dynamically meet management objectives on the testbed and can be changed on the fly.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
digital twin, Istio, Kubernetes, Performance management, reinforcement learning (RL), service mesh
National Category
Computer Sciences Control Engineering
Identifiers
urn:nbn:se:kth:diva-334446 (URN)10.1109/NOMS56928.2023.10154365 (DOI)2-s2.0-85164731961 (Scopus ID)
Conference
36th IEEE/IFIP Network Operations and Management Symposium, NOMS 2023, Miami, United States of America, May 8 2023 - May 12 2023
Note

Part of ISBN 9781665477161

QC 20230821

Available from: 2023-08-21 Created: 2023-08-21 Last updated: 2024-06-10Bibliographically approved
Samani, F. S. & Stadler, R. (2022). Dynamically meeting performance objectives for multiple services on a service mesh. In: Charalambides, M Papadimitriou, P Cerroni, W Kanhere, S Mamatas, L (Ed.), 2022 18TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT (CNSM 2022): INTELLIGENT MANAGEMENT OF DISRUPTIVE NETWORK TECHNOLOGIES AND SERVICES. Paper presented at 18th International Conference on Network and Service Management (CNSM) - Intelligent Management of Disruptive Network Technologies and Services, OCT 31-NOV 04, 2022, Thessaloniki, GREECE (pp. 219-225). IEEE
Open this publication in new window or tab >>Dynamically meeting performance objectives for multiple services on a service mesh
2022 (English)In: 2022 18TH INTERNATIONAL CONFERENCE ON NETWORK AND SERVICE MANAGEMENT (CNSM 2022): INTELLIGENT MANAGEMENT OF DISRUPTIVE NETWORK TECHNOLOGIES AND SERVICES / [ed] Charalambides, M Papadimitriou, P Cerroni, W Kanhere, S Mamatas, L, IEEE , 2022, p. 219-225Conference paper, Published paper (Refereed)
Abstract [en]

We present a framework that lets a service provider achieve end-to-end management objectives under varying load. Dynamic control actions are performed by a reinforcement learning (RL) agent. Our work includes experimentation and evaluation on a laboratory testbed where we have implemented basic information services on a service mesh supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, and service differentiation. These objectives are mapped onto reward functions that an RL agent learns to optimize, by executing control actions, namely, request routing and request blocking. We compute the control policies not on the testbed, but in a simulator, which speeds up the learning process by orders of magnitude. In our approach, the system model is learned on the testbed; it is then used to instantiate the simulator, which produces near-optimal control policies for various management objectives. The learned policies are then evaluated on the testbed using unseen load patterns.

Place, publisher, year, edition, pages
IEEE, 2022
Series
International Conference on Network and Service Management, ISSN 2165-9605
Keywords
Performance management, reinforcement learning, service mesh, digital twin
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-323573 (URN)10.23919/CNSM55787.2022.9965074 (DOI)000903721000027 ()2-s2.0-85143912559 (Scopus ID)
Conference
18th International Conference on Network and Service Management (CNSM) - Intelligent Management of Disruptive Network Technologies and Services, OCT 31-NOV 04, 2022, Thessaloniki, GREECE
Note

Part of proceedings: ISBN 978-3-903176-51-5

QC 20230207

Available from: 2023-02-07 Created: 2023-02-07 Last updated: 2024-06-10Bibliographically approved
Samani, F. S., Stadler, R., Flinta, C. & Johnsson, A. (2021). Conditional Density Estimation of Service Metrics for Networked Services. IEEE Transactions on Network and Service Management, 18(2), 2350-2364
Open this publication in new window or tab >>Conditional Density Estimation of Service Metrics for Networked Services
2021 (English)In: IEEE Transactions on Network and Service Management, E-ISSN 1932-4537, Vol. 18, no 2, p. 2350-2364Article in journal (Refereed) Published
Abstract [en]

We predict the conditional distributions of service metrics, such as response time or frame rate, from infrastructure measurements in a networked environment. From such distributions, key statistics of the service metrics, including mean, variance, or quantiles can be computed, which are essential for predicting SLA conformance and enabling service assurance. We present and assess two methods for prediction: (1) mixture models with Gaussian or Lognormal kernels, whose parameters are estimated using mixture density networks, a class of neural networks, and (2) histogram models, which require the target space to be discretized. We apply these methods to a VoD service and a KV store service running on our lab testbed. A comparative evaluation shows the relative effectiveness of the methods when applied to operational data. We find that both methods allow for accurate prediction. While mixture models provide a general and elegant solution, they incur a very high overhead related to hyper-parameter search and neural network training. Histogram models, on the other hand, allow for efficient training, but require adjustment to the specific use case.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2021
Keywords
Measurement, Kernel, Histograms, Data models, Mixture models, Estimation, Time factors, KPI prediction, conditional density estimation (CDE), machine learning, statistical learning, service engineering, network management
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-298658 (URN)10.1109/TNSM.2021.3077357 (DOI)000660636700085 ()2-s2.0-85105891540 (Scopus ID)
Note

QC 20210710

Available from: 2021-07-10 Created: 2021-07-10 Last updated: 2024-07-04Bibliographically approved
Wang, X., Samani, F. S., Johnsson, A. & Stadler, R. (2021). Online Feature Selection for Low-overhead Learning in Networked Systems. In: Chemouil, P Ulema, M Clayman, S Sayit, M Cetinkaya, C Secci, S (Ed.), Proceedings of the 2021 17th International Conference on Network and Service Management: Smart Management for Future Networks and Services, CNSM 2021. Paper presented at 17th International Conference on Network and Service Management, CNSM 2021, Online/Virtual, 25-29 October 2021 (pp. 527-529). Institute of Electrical and Electronics Engineers Inc.
Open this publication in new window or tab >>Online Feature Selection for Low-overhead Learning in Networked Systems
2021 (English)In: Proceedings of the 2021 17th International Conference on Network and Service Management: Smart Management for Future Networks and Services, CNSM 2021 / [ed] Chemouil, P Ulema, M Clayman, S Sayit, M Cetinkaya, C Secci, S, Institute of Electrical and Electronics Engineers Inc. , 2021, p. 527-529Conference paper, Published paper (Refereed)
Abstract [en]

Data-driven functions for operation and management require measurements and readings from distributed data sources for model training and prediction. While the number of candidate data sources can be very large, research has shown that it is often possible to reduce the number of data sources significantly while still allowing for accurate prediction. Consequently, there is potential to lower communication and computing resources needed to continuously extract, collect, and process this data. We demonstrate the operation of a novel online algorithm called OSFS, which sequentially processes the collected data and reduces the number of data sources for training prediction models. OSFS builds on two main ideas, namely (1) ranking the available data sources using (unsupervised) feature selection algorithms and (2) identifying stable feature sets that include only the top features. The demonstration shows the search space exploration, the iterative selection of feature sets, and the evaluation of the stability of these sets. The demonstration uses measurements collected from a KTH testbed, and the predictions relate to end-to-end KPIs for network services. 

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc., 2021
Series
International Conference on Network and Service Management, ISSN 2165-9605
Keywords
Data-driven Engineering, Feature Selection, Machine Learning, Network Management, Forecasting, Information management, Iterative methods, Online systems, Space research, Data driven, Data-source, Features selection, Features sets, Low overhead, Machine-learning, Networks management, Number of datum, Online feature selection, Feature extraction
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-316390 (URN)10.23919/CNSM52442.2021.9615548 (DOI)000836226700084 ()2-s2.0-85123422408 (Scopus ID)
Conference
17th International Conference on Network and Service Management, CNSM 2021, Online/Virtual, 25-29 October 2021
Note

Part of proceedings: ISBN 978-3-903176-36-2

QC 20220816

Available from: 2022-08-16 Created: 2022-08-16 Last updated: 2024-06-10Bibliographically approved
Wang, X., Samani, F. S. & Stadler, R. (2020). Online feature selection for rapid, low-overhead learning in networked systems. In: ZincirHeywood, N Ulema, M Sayit, M Clayman, S Kim, MS Cetinkaya, C (Ed.), 2020 16th international conference on network and service management (CNSM): . Paper presented at 16th International Conference on Network and Service Management (CNSM) / 2nd International Workshop on Analytics for Service and Application Management (AnServApp) / 1st International Workshop on the Future Evolution of Internet Protocols (IPFuture), NOV 02-06, 2020, ELECTR NETWORK. IEEE
Open this publication in new window or tab >>Online feature selection for rapid, low-overhead learning in networked systems
2020 (English)In: 2020 16th international conference on network and service management (CNSM) / [ed] ZincirHeywood, N Ulema, M Sayit, M Clayman, S Kim, MS Cetinkaya, C, IEEE , 2020Conference paper, Published paper (Refereed)
Abstract [en]

Data-driven functions for operation and management often require measurements collected through monitoring for model training and prediction. The number of data sources can be very large, which requires a significant communication and computing overhead to continuously extract and collect this data, as well as to train and update the machine-learning models. We present an online algorithm, called OSFS, that selects a small feature set from a large number of available data sources, which allows for rapid, low-overhead, and effective learning and prediction. OSFS is instantiated with a feature ranking algorithm and applies the concept of a stable feature set, which we introduce in the paper. We perform extensive, experimental evaluation of our method on data from an in-house testbed. We find that OSFS requires several hundreds measurements to reduce the number of data sources by two orders of magnitude, from which models are trained with acceptable prediction accuracy. While our method is heuristic and can be improved in many ways, the results clearly suggests that many learning tasks do not require a lengthy monitoring phase and expensive offline training.

Place, publisher, year, edition, pages
IEEE, 2020
Series
International Conference on Network and Service Management, ISSN 2165-9605
Keywords
Data-driven engineering, Machine learning (ML), Dimensionality reduction
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-291038 (URN)10.23919/CNSM50824.2020.9269066 (DOI)000612229200029 ()2-s2.0-85098668191 (Scopus ID)
Conference
16th International Conference on Network and Service Management (CNSM) / 2nd International Workshop on Analytics for Service and Application Management (AnServApp) / 1st International Workshop on the Future Evolution of Internet Protocols (IPFuture), NOV 02-06, 2020, ELECTR NETWORK
Note

QC 20210303

Available from: 2021-03-03 Created: 2021-03-03 Last updated: 2024-06-10Bibliographically approved
Samani, F. S., Stadler, R., Johnsson, A. & Flinta, C. (2019). Demonstration: Predicting Distributions of Service Metrics. In: 2019 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM): . Paper presented at 2019 IFIP/IEEE Symposium on Integrated Network and Service Management, IM 2019; Arlington; United States; 8 April 2019 through 12 April 2019 (pp. 745-746). Institute of Electrical and Electronics Engineers (IEEE), Article ID 8717915.
Open this publication in new window or tab >>Demonstration: Predicting Distributions of Service Metrics
2019 (English)In: 2019 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM), Institute of Electrical and Electronics Engineers (IEEE), 2019, p. 745-746, article id 8717915Conference paper, Published paper (Refereed)
Abstract [en]

The ability to predict conditional distributions of service metrics is key to understanding end-to-end service behavior. From conditional distributions, other metrics can be derived, such as expected values and quantiles, which are essential for assessing SLA conformance. Our demonstrator predicts conditional distributions and derived metrics estimation in real-time, using infrastructure measurements. The distributions are modeled as Gaussian mixtures whose parameters are estimated using a mixture density network. The predictions are produced for a Video-on-Demand service that runs on a testbed at KTH.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2019
Keywords
Service Engineering, Service Management, Machine Learning
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-254135 (URN)000469937200144 ()2-s2.0-85067047473 (Scopus ID)978-3-903176-15-7 (ISBN)
Conference
2019 IFIP/IEEE Symposium on Integrated Network and Service Management, IM 2019; Arlington; United States; 8 April 2019 through 12 April 2019
Note

QC 20190625

Available from: 2019-06-25 Created: 2019-06-25 Last updated: 2024-06-10Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-6343-7416

Search in DiVA

Show all publications