A Framework for Dynamically Meeting Performance Objectives on a Service Mesh
2024 (English)In: IEEE Transactions on Network and Service Management, E-ISSN 1932-4537, Vol. 21, no 6, p. 5992-6007Article in journal (Refereed) Published
Abstract [en]
We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. Our framework supports the design of a control agent for a given management objective. The management objective is defined first and then mapped onto available control actions. Several types of control actions can be executed simultaneously, which allows for efficient resource utilization. Second, the framework separates the learning of the system model and the operating region from the learning of the control policy. By first learning the system model and the operating region from testbed traces, we can instantiate a simulator and train the agent for different management objectives. Third, the use of a simulator shortens the training time by orders of magnitude compared with training the agent on the testbed. We evaluate the learned policies on the testbed and show the effectiveness of our approach in several scenarios. In one scenario, we design a controller that achieves the management objectives with 50% less system resources than Kubernetes HPA autoscaling.
Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024. Vol. 21, no 6, p. 5992-6007
Keywords [en]
Microservice architectures, Measurement, Training, Reinforcement learning, Delays, Resource management, Throughput, Performance management, adaptive resource allocation, microservice, operating region
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-359479DOI: 10.1109/TNSM.2024.3434328ISI: 001381366600033Scopus ID: 2-s2.0-85199569083OAI: oai:DiVA.org:kth-359479DiVA, id: diva2:1935122
Note
Not duplicate with DiVA 1858751
QC 20250206
2025-02-062025-02-062025-02-06Bibliographically approved