kth.sePublications KTH
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
An Information-Theoretic Approach to Bandits and Reinforcement Learning
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Information Science and Engineering.ORCID iD: 0000-0003-2598-4459
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In this thesis, we develop and extend information-theoretic methods for analyzing reinforcement learning, focussing particularly on bandit problems. Our aim is to understand how learning algorithms manage the fundamental trade-off between exploration and exploitation, and how information-theoretic quantities such as entropy and mutual information can be used to characterize and bound their performance. Building on the framework of Russo and Van Roy, which relates Bayesian regret of an algorithm to its learning efficiency, measured by the information ratio, we derive new results that broaden the scope of this approach to more complex and structured learning settings.

The first part of the thesis revisits the analysis of linear bandits, a model in which rewards depend linearly on the selected action via an unknown parameter. While existing information-theoretic analyses provide general regret bounds, they all assume finite action sets. We close this gap by introducing refined analyses and show that near-optimal regret bounds can be obtained even for infinite or continuous action spaces. 

The second part of the thesis extends the information analysis framework to more complex bandit models. In the contextual bandit setting, we build on the lifted information ratio and extend its application to more general reward classes. Additionally, we provide alternative proofs based on classical information-theoretic tools. This new derivation reveals simplifies the analysis and enables sharper guarantees for structured problems. We then study the Thompson Sampling algorithm for logistic bandits, a widely used model for binary rewards. We derive the first regret bounds that avoid exponential dependence on the logistic slope parameter and are independent of the number of actions, resolving a long standing open question in the field. 

The third part of the thesis takes a step beyond bandit models and investigates how information-theoretic principles can be extended to more general reinforcement learning problems. We study the theoretical performance limits and learning guarantees of model-based reinforcement learning, establishing a framework for analyzing its Bayesian regret. In parallel, we derive PAC-Bayesian bounds for offline decision-making, including improved guarantees for offline bandits that are parameter-free and achieve near-optimal rates.

Together, the contributions of this thesis provide a coherent extension of information-theoretic analysis to a wide range of learning settings; from linear to nonlinear, from discrete to continuous, and from bandits to reinforcement learning. 

Abstract [sv]

I denna avhandling utvecklar och utvidgar vi informationsteoretiska metoder för att analysera förstärkningsinlärning (eng: reinforcement learning), med särskilt fokus på banditproblem (eng: bandits). Vårt mål är att förstå hur inlärningsalgoritmer hanterar den grundläggande avvägningen mellan utforskning (eng: exploration) och exploatering (eng: exploitation), och hur informationsteoretiska storheter såsom entropi och ömsesidig information (eng: mutual information) kan användas för att karakterisera och begränsa deras prestanda. Med utgångspunkt i ramverket av Russo och Van Roy, som relaterar en algoritms Bayesianska ånger (eng: Bayesian regret) till dess inlärningseffektivitet mätt genom den så kallade informationskvoten (eng: information ratio), härleder vi nya resultat som utvidgar detta angreppssätt till mer komplexa och strukturerade inlärningsproblem.

Den första delen av avhandlingen återbesöker analysen av linjära banditer, en modell där belöningar beror linjärt på den valda handlingen via en okänd parameter. Även om befintliga informationsteoretiska analyser ger generella ångergränser, förutsätter de alltid ett ändligt handlingsutrymme. Vi fyller denna lucka genom att introducera förfinade analyser som visar att nära-optimala ångergränser kan uppnås även för oändliga eller kontinuerliga handlingsutrymmen.

Den andra delen av avhandlingen utvidgar det informationsteoretiska analysramverket till mer komplexa bandit-modeller. I den kontextuella bandit-miljön bygger vi vidare på den så kallade upplyfta informationskvoten (eng: lifted information ratio) och visar hur denna kan tillämpas för ett bredare spektrum av belöningsmodeller. Därtill tillhandahåller vi alternativa bevis baserade på klassiska informationsteoretiska verktyg, vilket förenklar analysen och möjliggör skarpare garantier för strukturerade problem. Vi studerar därefter Thompson Sampling-algoritmen för logistiska banditer, en centralt använd modell för binära belöningar. Här härleder vi de första ångergränserna som undviker det tidigare nödvändiga exponentiella beroendet av den logistiska lutningsparametern och som dessutom är oberoende av antalet handlingar, vilket löser en länge öppen fråga i området.

Den tredje delen av avhandlingen går bortom bandit-modeller och undersöker hur informationsteoretiska principer kan generaliseras till mer allmänna problem inom förstärkningsinlärning. Vi analyserar teoretiska prestandagränser och inlärningsgarantier för modellbaserad förstärkningsinlärning och etablerar ett ramverk för att studera dess Bayesianska ånger. Parallellt härleder vi PAC-Bayesianska gränser för offline-beslutsfattande (eng: offline decision-making), inklusive förbättrade garantier för offline banditer som är parameterfria och uppnår nära-optimala konvergenshastigheter.

Sammantaget presenterar avhandlingen en sammanhängande och bred utvidgning av informationsteoretisk analys till ett stort spektrum av inlärningsmiljöer: från linjära till icke-linjära modeller, från diskreta till kontinuerliga handlingsutrymmen, och från banditer till allmän förstärkningsinlärning.

Place, publisher, year, edition, pages
Stockholm: Kungliga Tekniska högskolan, 2025. , p. xi, 45
Series
TRITA-EECS-AVL ; 2025:108
Keywords [en]
Reinforcement learning, Bandit problems, Information theory, Regret analysis, Thompson Sampling, PAC-Bayes bounds.
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
URN: urn:nbn:se:kth:diva-373221ISBN: 978-91-8106-474-2 (print)OAI: oai:DiVA.org:kth-373221DiVA, id: diva2:2015738
Public defence
2025-12-15, https://kth-se.zoom.us/j/61173029059, F3, Flodis, Lindstedtvägen 26,, Stockholm, 13:15 (English)
Opponent
Supervisors
Note

QC 20251124

Available from: 2025-11-24 Created: 2025-11-21 Last updated: 2025-11-24Bibliographically approved
List of papers
1. An Information-Theoretic Analysis of Thompson Sampling with Infinite Action Spaces
Open this publication in new window or tab >>An Information-Theoretic Analysis of Thompson Sampling with Infinite Action Spaces
2025 (English)In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) / [ed] IEEE, Institute of Electrical and Electronics Engineers (IEEE) , 2025Conference paper, Published paper (Refereed)
Abstract [en]

This paper studies the Bayesian regret of the Thompson Sampling algorithm for bandit problems, building on the information-theoretic framework introduced by Russo and Van Roy [1]. Specifically, it extends the rate-distortion analysis of Dong and Van Roy [2], which provides near-optimal bounds for linear bandits. A key limitation of these results is the assumption of a finite action space. We address this by extending the analysis to settings with infinite and continuous action spaces. Additionally, we specialize our results to bandit problems with expected rewards that are Lipschitz continuous with respect to the action space, deriving a regret bound that explicitly accounts for the complexity of the action space.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-373210 (URN)10.1109/ICASSP49660.2025.10888239 (DOI)2-s2.0-105003867651 (Scopus ID)
Conference
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, April 6-11, 2025
Note

QC 20251124

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-11-24Bibliographically approved
2. Chained Information-Theoretic Bounds and Tight Regret Rate for Linear Bandit Problems
Open this publication in new window or tab >>Chained Information-Theoretic Bounds and Tight Regret Rate for Linear Bandit Problems
2025 (English)In: Article in journal (Refereed) Submitted
Abstract [en]

This paper studies the Bayesian regret of a variant of the Thompson-Sampling algorithm for bandit problems. It builds upon the information-theoretic framework of [1] and, more specifically, on the rate-distortion analysis from [2], where they proved a bound with regret rate of O(d T log(T))for the d-dimensional linear bandit setting. We focus on bandit problems with a metric action space and, using a chaining argument, we establish new bounds that depend on the metric entropy of the action space for a variant of Thompson-Sampling. Under suitable continuity assumption of the rewards, our bound offers a tight rate of O(d√T) for d-dimensional linear bandit problems.

National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-373211 (URN)
Note

QC 20251124

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-11-24Bibliographically approved
3. Thompson Sampling Regret Bounds for Contextual Bandits with sub-Gaussian rewards
Open this publication in new window or tab >>Thompson Sampling Regret Bounds for Contextual Bandits with sub-Gaussian rewards
2023 (English)In: 2023 IEEE International Symposium on Information Theory, ISIT 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 1306-1311Conference paper, Published paper (Refereed)
Abstract [en]

In this work, we study the performance of the Thompson Sampling algorithm for Contextual Bandit problems based on the framework introduced by [1] and their concept of lifted information ratio. First, we prove a comprehensive bound on the Thompson Sampling expected cumulative regret that depends on the mutual information of the environment parameters and the history. Then, we introduce new bounds on the lifted information ratio that hold for sub-Gaussian rewards, thus generalizing the results from [1] which analysis requires binary rewards. Finally, we provide explicit regret bounds for the special cases of unstructured bounded contextual bandits, structured bounded contextual bandits with Laplace likelihood, structured Bernoulli bandits, and bounded linear contextual bandits.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
National Category
Probability Theory and Statistics Computer Sciences
Identifiers
urn:nbn:se:kth:diva-337882 (URN)10.1109/ISIT54713.2023.10206792 (DOI)2-s2.0-85171469978 (Scopus ID)
Conference
2023 IEEE International Symposium on Information Theory, ISIT 2023, Taipei, Taiwan, Jun 25 2023 - Jun 30 2023
Note

Part of ISBN 9781665475549

QC 20231010

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2025-11-24Bibliographically approved
4. An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems
Open this publication in new window or tab >>An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems
2025 (English)In: Article in journal (Refereed) Submitted
Abstract [en]

We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, $\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle))$, with parameter $\beta>0$, and where both the action $a\in \mathcal{A}$ and the unknown parameter $\theta \in \mathcal{O}$ lie within the $d$-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo and Van Roy, we derive regret bounds via the analysis of the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred by the agent and the information it gained about the parameter $\theta$.We improve upon previous results and establish that the information ratio is bounded in $O\big(d \alpha^{-2} \big)$, where $d$ is the dimension of the problem and $\alpha$ is a \emph{minimax measure} of the alignment between the action space $\mathcal{A}$ and the parameter space $\mathcal{O}$. Notably our bound does not scale exponentially with the logistic slope and is independent of the cardinality of the action and parameter spaces. Using this result, we derive a bound on the Thompson Sampling expected regret of order $O(d \alpha^{-1} \sqrt{T \log(\beta T/d)})$, where $T$ is the number of time steps. To our knowledge, this is the \emph{first regret bound for any logistic bandit algorithm} that eliminates any exponential scaling with $\beta$ and is independent of the number of actions. In particular, when the parameters are on the sphere and the action space contains the parameter space, the expected regret bound is of order $O(d \sqrt{T \log(\beta T/d)})$.

National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-373212 (URN)
Note

QC 20251124

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-11-30Bibliographically approved
5. An Information-Theoretic Analysis of Bayesian Reinforcement Learning
Open this publication in new window or tab >>An Information-Theoretic Analysis of Bayesian Reinforcement Learning
2022 (English)In: 2022 58Th Annual Allerton Conference On Communication, Control, And Computing (ALLERTON), Institute of Electrical and Electronics Engineers (IEEE) , 2022Conference paper, Published paper (Refereed)
Abstract [en]

Building on the framework introduced by Xu and Raginsky[1] for supervised learning problems, we study the best achievable performance for model-based Bayesian reinforcement learning problems. With this purpose, we define the minimum Bayesian regret (MBR) as the difference between the maximum expected cumulative reward obtainable either by learning from the collected data or by knowing the environment and its dynamics. We specialize this definition to reinforcement learning problems modeled as Markov decision processes (MDPs) whose kernel parameters are unknown to the agent and whose uncertainty is expressed by a prior distribution. One method for deriving upper bounds on the MBR is presented and specific bounds based on the relative entropy and the Wasserstein distance are given. We then focus on two particular cases of MDPs, the multi-armed bandit problem (MAB) and the online optimization with partial feedback problem. For the latter problem, we show that our bounds can recover from below the current information-theoretic bounds by Russo and Van Roy [2].

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2022
Series
Annual Allerton Conference on Communication Control and Computing, ISSN 2474-0195
Keywords
information-theoretic bounds, Markov decision process, multi-armed bandit problem, reinforcement learning, Bayesian regret, mutual information, Wasserstein distance
National Category
Computer Sciences Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-323211 (URN)10.1109/ALLERTON49937.2022.9929353 (DOI)000895747000017 ()2-s2.0-85142666730 (Scopus ID)
Conference
58th Annual Allerton Conference on Communication, Control, and Computing (Allerton), SEP 28-30, 2022, Monticello, IL
Note

Part of proceedings: ISBN 979-8-3503-9998-1, QC 20230130

Available from: 2023-01-30 Created: 2023-01-30 Last updated: 2025-11-24Bibliographically approved
6. Refined PAC-Bayes Bounds for Offline Bandits
Open this publication in new window or tab >>Refined PAC-Bayes Bounds for Offline Bandits
2025 (English)In: 2025 IEEE International Symposium on Information Theory (ISIT) / [ed] IEEE, Institute of Electrical and Electronics Engineers (IEEE) , 2025Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present refined probabilistic bounds on empirical reward estimates for off-policy learning in bandit problems. We build on the PAC-Bayesian bounds from [1]–[4] and improve on their results using a new parameter optimization approach introduced by [5]. This technique is based on a discretization of the space of possible events to optimize the “in probability” parameter, λ. We provide two parameter-free PAC-Bayes bounds, one based on Hoeffding-Azuma’s inequality and the other based on Bernstein’s inequality. We prove that our bounds are almost optimal as they recover the same rate as would be obtained by setting the “in probability” parameter after the realization of the data.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-373216 (URN)10.1109/ISIT63088.2025.11195415 (DOI)2-s2.0-105021929724 (Scopus ID)
Conference
International Symposium on Information Theory (ISIT), Ann Arbor, MI, USA, June 22-27, 2025
Note

Part of ISBN 9798331543990

QC 20251124

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-11-24Bibliographically approved

Open Access in DiVA

kappa(890 kB)198 downloads
File information
File name FULLTEXT01.pdfFile size 890 kBChecksum SHA-512
f62f4f6d38f705e5a422a06d3216ba727035cd92894b7af7aaa32c0d881956e98c5f043118581a0c654dbc5dff06684ebdff98c677811d69c579574278181c60
Type fulltextMimetype application/pdf

Authority records

Gouverneur, Amaury

Search in DiVA

By author/editor
Gouverneur, Amaury
By organisation
Information Science and Engineering
Other Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1439 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf