kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).ORCID iD: 0000-0001-5779-1649
MIT, Cambridge, USA.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Digital futures.ORCID iD: 0000-0002-4679-4673
2024 (English)In: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Neural Information Processing Systems Foundation , 2024Conference paper, Published paper (Refereed)
Abstract [en]

We consider the problem of learning an ε-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an ε-optimal policy using Õ(S+A/poly(1 - γ)ε2) samples where S (resp. A) denotes the number of states (resp. actions) and γ the discount factor. Our algorithm achieves this order-optimal (in S, A and ε) sample complexity under milder conditions than those assumed in previously proposed approaches.

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation , 2024.
National Category
Control Engineering Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-361953Scopus ID: 2-s2.0-105000534247OAI: oai:DiVA.org:kth-361953DiVA, id: diva2:1949626
Conference
38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, Dec 9 2024 - Dec 15 2024
Note

QC 20250404

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-09-22Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Stojanovic, StefanProutiere, Alexandre

Search in DiVA

By author/editor
Stojanovic, StefanProutiere, Alexandre
By organisation
Decision and Control Systems (Automatic Control)Digital futures
Control EngineeringSignal Processing

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 81 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf