kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Regret Analysis in Deterministic Reinforcement Learning
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).ORCID iD: 0000-0003-4606-0060
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).ORCID iD: 0000-0002-4679-4673
2021 (English)In: 2021 60TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), Institute of Electrical and Electronics Engineers (IEEE) , 2021, p. 2246-2251Conference paper, Published paper (Refereed)
Abstract [en]

We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter (in contrast to previous minimax approaches) and thus, truly quantify the fundamental limit of the performance achievable by any learning algorithm. Deterministic MDPs can be interpreted as graphs and analyzed in terms of their cycles, a fact which we leverage in order to identify a class of deterministic MDPs whose regret lower bound can be determined numerically. We further exemplify this result on a deterministic line search problem, and a deterministic MDP with state-dependent rewards, whose regret lower bounds we can state explicitly. These bounds share similarities with the known problem-specific bound of the multi-armed bandit problem and suggest that navigation on a deterministic MDP need not have an effect on the performance of a learning algorithm.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2021. p. 2246-2251
Series
IEEE Conference on Decision and Control, ISSN 0743-1546
National Category
Computer Sciences Other Physics Topics Communication Systems
Identifiers
URN: urn:nbn:se:kth:diva-313022DOI: 10.1109/CDC45484.2021.9682972ISI: 000781990302009Scopus ID: 2-s2.0-85126035898OAI: oai:DiVA.org:kth-313022DiVA, id: diva2:1662640
Conference
60th IEEE Conference on Decision and Control (CDC), DEC 13-17, 2021, ELECTR NETWORK
Note

Part of proceedings: ISBN 978-1-6654-3659-5

QC 20220601

Available from: 2022-06-01 Created: 2022-06-01 Last updated: 2023-10-03Bibliographically approved
In thesis
1. Reinforcement Learning and Optimal Adaptive Control for Structured Dynamical Systems
Open this publication in new window or tab >>Reinforcement Learning and Optimal Adaptive Control for Structured Dynamical Systems
2023 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In this thesis, we study the related problems of reinforcement learning and optimal adaptive control, specialized to specific classes of stochastic and structured dynamical systems. By stochastic, we mean systems that are unknown to the decision maker and evolve according to some probabilistic law. By structured, we mean that they are restricted in some known way, e.g., they belong to a specific model class or must obey a set of known constraints. The objective in both problems is the design of an optimal algorithm, i.e., one that maximizes a certain performance metric. Because of the stochasticity, the algorithm faces an exploration-exploitation dilemma, where it must balance collecting information from the system and leveraging existing information to choose the best action or input. This trade-off  is best captured by the notion of regret, defined as the difference between the performance of the algorithm and an oracle which has full knowledge of the system. In the first part of the thesis, we investigate systems that can be modeled as Markov Decision Processes (MDPs) and derive general asymptotic and problem-specific regret lower bounds for ergodic and deterministic MDPs. We make these bounds explicit for MDPs that: i) are ergodic and unstructured, ii) have Lipschitz transitions and rewards, and iii) are deterministic and satisfy a decoupling property. Furthermore, we propose Directed Exploration Leaning (DEL), an algorithm that is valid for any ergodic MDP with any structure and whose regret upper bound matches the associated regret lower bounds, thus being truly optimal. For this algorithm, we present theoretical regret guarantees as well as a numerical demonstration that verifies its ability to exploit the underlying structure. In the second part, we study systems with uncertain linear dynamics and which are subject to additive disturbances as well as state and input constraints. We develop Self-Tuning Tube-based Model Predictive Control (STTMPC), an adaptive and robust model predictive control algorithm which leverages the least-squares estimator as well as polytopic tubes to guarantee robust constraint satisfaction, along with recursive feasibility, and input-to-state stability. The algorithm also ensures the persistence of excitation without compromising the system's asymptotic performance and with no increase in computational complexity. We also provide guarantees on the expected regret of STT-MPC, in the form of an upper bound whose rate explicitly depends on the chosen rate of excitation. The performance of the algorithm is also demonstrated via a numerical example.

Place, publisher, year, edition, pages
Stockholm: Kungliga Tekniska högskolan, 2023. p. 152
Series
TRITA-EECS-AVL ; 2023:67
Keywords
Reinforcement Learning, Adaptive Control, Dynamical Systems, Control Theory, Control Engineering
National Category
Control Engineering
Research subject
Electrical Engineering
Identifiers
urn:nbn:se:kth:diva-337406 (URN)978-91-8040-712-0 (ISBN)
Public defence
2023-10-23, Kollegiesalen, Brinellvägen 6, Stockholm, 14:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), WASP 66453
Note

QC 20231003

Available from: 2023-10-04 Created: 2023-10-02 Last updated: 2023-10-04Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Tranos, DamianosProutiere, Alexandre

Search in DiVA

By author/editor
Tranos, DamianosProutiere, Alexandre
By organisation
Decision and Control Systems (Automatic Control)
Computer SciencesOther Physics TopicsCommunication Systems

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 69 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf