Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik.ORCID-id: 0000-0001-5779-1649
MIT, Cambridge, USA.
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik. KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Digital futures.ORCID-id: 0000-0002-4679-4673
2024 (engelsk)Inngår i: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Neural Information Processing Systems Foundation , 2024Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We consider the problem of learning an ε-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an ε-optimal policy using Õ(S+A/poly(1 - γ)ε2) samples where S (resp. A) denotes the number of states (resp. actions) and γ the discount factor. Our algorithm achieves this order-optimal (in S, A and ε) sample complexity under milder conditions than those assumed in previously proposed approaches.

sted, utgiver, år, opplag, sider
Neural Information Processing Systems Foundation , 2024.
HSV kategori
Identifikatorer
URN: urn:nbn:se:kth:diva-361953Scopus ID: 2-s2.0-105000534247OAI: oai:DiVA.org:kth-361953DiVA, id: diva2:1949626
Konferanse
38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, Dec 9 2024 - Dec 15 2024
Merknad

QC 20250404

Tilgjengelig fra: 2025-04-03 Laget: 2025-04-03 Sist oppdatert: 2025-09-22bibliografisk kontrollert

Open Access i DiVA

Fulltekst mangler i DiVA

Scopus

Person

Stojanovic, StefanProutiere, Alexandre

Søk i DiVA

Av forfatter/redaktør
Stojanovic, StefanProutiere, Alexandre
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric

urn-nbn
Totalt: 83 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf