kth.sePublikationer KTH
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik.ORCID-id: 0000-0001-5779-1649
MIT, Cambridge, USA.
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Reglerteknik. KTH, Skolan för elektroteknik och datavetenskap (EECS), Centra, Digital futures.ORCID-id: 0000-0002-4679-4673
2024 (Engelska)Ingår i: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Neural Information Processing Systems Foundation , 2024Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We consider the problem of learning an ε-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an ε-optimal policy using Õ(S+A/poly(1 - γ)ε2) samples where S (resp. A) denotes the number of states (resp. actions) and γ the discount factor. Our algorithm achieves this order-optimal (in S, A and ε) sample complexity under milder conditions than those assumed in previously proposed approaches.

Ort, förlag, år, upplaga, sidor
Neural Information Processing Systems Foundation , 2024.
Nationell ämneskategori
Reglerteknik Signalbehandling
Identifikatorer
URN: urn:nbn:se:kth:diva-361953Scopus ID: 2-s2.0-105000534247OAI: oai:DiVA.org:kth-361953DiVA, id: diva2:1949626
Konferens
38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, Dec 9 2024 - Dec 15 2024
Anmärkning

QC 20250404

Tillgänglig från: 2025-04-03 Skapad: 2025-04-03 Senast uppdaterad: 2025-09-22Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Scopus

Person

Stojanovic, StefanProutiere, Alexandre

Sök vidare i DiVA

Av författaren/redaktören
Stojanovic, StefanProutiere, Alexandre
Av organisationen
ReglerteknikDigital futures
ReglerteknikSignalbehandling

Sök vidare utanför DiVA

GoogleGoogle Scholar

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 84 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf