kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Framework for Dynamically Meeting Performance Objectives on a Service Mesh
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0002-6343-7416
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Network and Systems Engineering.ORCID iD: 0000-0001-6039-8493
2024 (English)In: IEEE Transactions on Network and Service Management, E-ISSN 1932-4537, Vol. 21, no 6, p. 5992-6007Article in journal (Refereed) Published
Abstract [en]

We present a framework for achieving end-to-end management objectives for multiple services that concurrently execute on a service mesh. We apply reinforcement learning (RL) techniques to train an agent that periodically performs control actions to reallocate resources. We develop and evaluate the framework using a laboratory testbed where we run information and computing services on a service mesh, supported by the Istio and Kubernetes platforms. We investigate different management objectives that include end-to-end delay bounds on service requests, throughput objectives, cost-related objectives, and service differentiation. Our framework supports the design of a control agent for a given management objective. The management objective is defined first and then mapped onto available control actions. Several types of control actions can be executed simultaneously, which allows for efficient resource utilization. Second, the framework separates the learning of the system model and the operating region from the learning of the control policy. By first learning the system model and the operating region from testbed traces, we can instantiate a simulator and train the agent for different management objectives. Third, the use of a simulator shortens the training time by orders of magnitude compared with training the agent on the testbed. We evaluate the learned policies on the testbed and show the effectiveness of our approach in several scenarios. In one scenario, we design a controller that achieves the management objectives with 50% less system resources than Kubernetes HPA autoscaling.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024. Vol. 21, no 6, p. 5992-6007
Keywords [en]
Microservice architectures, Measurement, Training, Reinforcement learning, Delays, Resource management, Throughput, Performance management, adaptive resource allocation, microservice, operating region
National Category
Computer Systems
Identifiers
URN: urn:nbn:se:kth:diva-359479DOI: 10.1109/TNSM.2024.3434328ISI: 001381366600033Scopus ID: 2-s2.0-85199569083OAI: oai:DiVA.org:kth-359479DiVA, id: diva2:1935122
Note

Not duplicate with DiVA 1858751

QC 20250206

Available from: 2025-02-06 Created: 2025-02-06 Last updated: 2025-02-06Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Samani, Forough ShahabStadler, Rolf

Search in DiVA

By author/editor
Samani, Forough ShahabStadler, Rolf
By organisation
Network and Systems Engineering
In the same journal
IEEE Transactions on Network and Service Management
Computer Systems

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 30 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf