kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Navigating to the Best Policy in Markov Decision Processes
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).ORCID iD: 0000-0002-4679-4673
2021 (English)In: Advances in Neural Information Processing Systems, Neural Information Processing Systems Foundation (NIPS) , 2021, p. 25852-25864Conference paper, Published paper (Refereed)
Abstract [en]

We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least 1 - δ. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the generative setting [MP21], where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the navigation constraints, induced by the online setting. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation (NIPS) , 2021. p. 25852-25864
Keywords [en]
Average numbers, Ergodic theorem, Ergodicity, Fast convergence, Low bound, Markov Decision Processes, Non-homogeneous, On-line setting, Sample complexity, System trajectory
National Category
Mathematical Analysis
Identifiers
URN: urn:nbn:se:kth:diva-316207ISI: 000922928208053Scopus ID: 2-s2.0-85128435560OAI: oai:DiVA.org:kth-316207DiVA, id: diva2:1686766
Conference
35th Conference on Neural Information Processing Systems, NeurIPS 2021, 6-14 December 2021
Note

Part of proceedings: ISBN 978-1-7138-4539-3

QC 20220811

Available from: 2022-08-11 Created: 2022-08-11 Last updated: 2023-09-22Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Proutiere, Alexandre

Search in DiVA

By author/editor
Proutiere, Alexandre
By organisation
Decision and Control Systems (Automatic Control)
Mathematical Analysis

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 152 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf