kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (3 of 3) Show all publications
Jedra, Y., Réveillard, W., Stojanovic, S. & Proutiere, A. (2024). Low-rank bandits via tight two-to-infinity singular subspace recovery. In: International Conference on Machine Learning, ICML 2024: . Paper presented at 41st International Conference on Machine Learning, ICML 2024, July 21-27, 2024, Vienna, Austria (pp. 21430-21485). ML Research Press
Open this publication in new window or tab >>Low-rank bandits via tight two-to-infinity singular subspace recovery
2024 (English)In: International Conference on Machine Learning, ICML 2024, ML Research Press , 2024, p. 21430-21485Conference paper, Published paper (Refereed)
Abstract [en]

We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair (i, j) ∈ [m] × [n] is selected, the learner observes a noisy sample of the (i, j)-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an ε-optimal policy with probability at least 1 - δ typically scales as m + n/ε2 log(1/δ). Our regret minimization algorithm enjoys minimax guarantees typically scaling as r5/4(m + n)3/4 √T, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly r(m + n) and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.

Place, publisher, year, edition, pages
ML Research Press, 2024
National Category
Control Engineering
Identifiers
urn:nbn:se:kth:diva-353951 (URN)2-s2.0-85203793840 (Scopus ID)
Conference
41st International Conference on Machine Learning, ICML 2024, July 21-27, 2024, Vienna, Austria
Note

QC 20240926

Available from: 2024-09-25 Created: 2024-09-25 Last updated: 2024-09-26Bibliographically approved
Stojanovic, S., Jedra, Y. & Proutiere, A. (2024). Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation. In: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024: . Paper presented at 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, Dec 9 2024 - Dec 15 2024. Neural Information Processing Systems Foundation
Open this publication in new window or tab >>Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation
2024 (English)In: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Neural Information Processing Systems Foundation , 2024Conference paper, Published paper (Refereed)
Abstract [en]

We consider the problem of learning an ε-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an ε-optimal policy using Õ(S+A/poly(1 - γ)ε2) samples where S (resp. A) denotes the number of states (resp. actions) and γ the discount factor. Our algorithm achieves this order-optimal (in S, A and ε) sample complexity under milder conditions than those assumed in previously proposed approaches.

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation, 2024
National Category
Control Engineering Signal Processing
Identifiers
urn:nbn:se:kth:diva-361953 (URN)2-s2.0-105000534247 (Scopus ID)
Conference
38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, Dec 9 2024 - Dec 15 2024
Note

QC 20250404

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-09-22Bibliographically approved
Stojanovic, S., Jedra, Y. & Proutiere, A. (2023). Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement Learning. In: Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023: . Paper presented at 37th Conference on Neural Information Processing Systems, NeurIPS 2023, New Orleans, United States of America, Dec 10 2023 - Dec 16 2023. Neural Information Processing Systems Foundation
Open this publication in new window or tab >>Spectral Entry-wise Matrix Estimation for Low-Rank Reinforcement Learning
2023 (English)In: Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023, Neural Information Processing Systems Foundation , 2023Conference paper, Published paper (Refereed)
Abstract [en]

We study matrix estimation problems arising in reinforcement learning (RL) with low-rank structure.In low-rank bandits, the matrix to be recovered specifies the expected arm rewards, and for low-rank Markov Decision Processes (MDPs), it may for example characterize the transition kernel of the MDP.In both cases, each entry of the matrix carries important information, and we seek estimation methods with low entry-wise error.Importantly, these methods further need to accommodate for inherent correlations in the available data (e.g.for MDPs, the data consists of system trajectories).We investigate the performance of simple spectral-based matrix estimation approaches: we show that they efficiently recover the singular subspaces of the matrix and exhibit nearly-minimal entry-wise error.These new results on low-rank matrix estimation make it possible to devise reinforcement learning algorithms that fully exploit the underlying low-rank structure.We provide two examples of such algorithms: a regret minimization algorithm for low-rank bandit problems, and a best policy identification algorithm for reward-free RL in low-rank MDPs.Both algorithms yield state-of-the-art performance guarantees.

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation, 2023
National Category
Control Engineering
Identifiers
urn:nbn:se:kth:diva-349882 (URN)2-s2.0-85187453998 (Scopus ID)
Conference
37th Conference on Neural Information Processing Systems, NeurIPS 2023, New Orleans, United States of America, Dec 10 2023 - Dec 16 2023
Note

QC 20240704

Available from: 2024-07-04 Created: 2024-07-04 Last updated: 2024-07-04Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-5779-1649

Search in DiVA

Show all publications