kth.sePublications
23456785 of 20
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Fundamental Limits in Stochastic Bandits
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).ORCID iD: 0000-0002-4617-8862
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis contributes to the field of stochastic bandits by exploring the fundamental limits (information-theoretic lower bounds) of three prevalent objects in various reinforcement learning applications, through a collection of five distinct papers. Each paper presents a novel perspective under a specific structure and scenario. The first paper investigates regret minimization in decentralized multi-agent multi-armed bandits. The second and third papers delve into pure exploration with fixed confidence in a broad spectrum of structured bandits. The last two papers focus on offering new insights into the best arm identification with a fixed budget. 

In the first paper, two popular scenarios in a decentralized multi-agent setting are addressed, one involving collision and the other not. In each of them, we propose an instance-specific optimal algorithm. Interestingly, our results show that the fundamental limits match the ones in the centralized analogue.The second paper introduces a simple but versatile algorithm, Frank-Wolfe Sampling, which achieves instance-specific optimality across a wide collection of pure explorations in structured bandits. Meanwhile, the numerical results and current studies demonstrate the strong numerical performance of our algorithm in various pure exploration problems. However, Frank-Wolfe Sampling is not computationally efficient when the number of arms is extremely large. To address this issue, the third paper introduces Perturbed Frank-Wolfe Sampling, which can be implemented in polynomial time while maintaining the instance-specific minimal sample complexity in combinatorial semi-bandits.

Unlike the sample complexity or regret minimization discussed above, characterizing the fundamental limit of the error probability in best arm identification with a fixed budget remains a challenge. The fourth paper addresses this challenge in two-armed bandits, introducing a new class of algorithms, stable algorithms, which encompass a broad range of reasonable algorithms. We demonstrate that no consistent and stable algorithm surpasses the algorithm that samples each arm evenly, answering the open problems formulated in prior work. In general multi-armed bandits, the final paper in this thesis presents, to our knowledge, the first large deviation theorem for the generic adaptive algorithm. Based on this, we provide the exact analysis of the celebrated algorithm, Successive Rejects. Furthermore, this new large deviation technique allows us to devise and analyze a new adaptive algorithm, which is the current state-of-the-art to the best of our knowledge. This thesis provides new insight for deriving fundamental limits in various online stochastic learning problems. This understanding guides us to develop more efficient algorithms and systems.

Abstract [sv]

Denna avhandling bidrar till området för Förstärkningsinlärning (RL) genom att utforska de grundläggande gränserna (informationsteoretiska nedre gränser) för tre vanliga objekt i olika förstärkningsinlärningsapplikationer, genom en samling av fem distinkta uppsatser. Varje uppsats presenterar ett nytt perspektiv under en specifik struktur och scenario. Den första uppsatsen undersöker ångerminimering i decentraliserade multi-agent multi-armed banditer. Den andra och tredje uppsatsen dyker in i ren utforskning med fast förtroende i ett brett spektrum av strukturerade banditer. De två sista uppsatserna fokuserar på att erbjuda nya insikter i identifieringen av den bästa armen med en fast budget. 

I den första uppsatsen behandlas två populära scenarier i en decentraliserad multi-agent inställning, en som involverar kollision och den andra inte. I vardera av dem föreslår vi en instansspecifik optimal algoritm. Intressant nog visar våra resultat att de grundläggande gränserna matchar de som finns i den centraliserade analogin.Den andra uppsatsen introducerar en enkel men mångsidig algoritm, Frank-Wolfe Sampling, som uppnår instansspecifik optimalitet över en bred samling av rena utforskningar i strukturerade banditer. Samtidigt demonstrerar de numeriska resultaten och aktuella studier den starka numeriska prestandan för vår algoritm i olika rena utforskningsproblem. Dock är Frank-Wolfe Sampling inte beräkningsmässigt effektiv när antalet armar är extremt stort. För att lösa detta problem introducerar den tredje uppsatsen Perturbed Frank-Wolfe Sampling, vilket kan implementeras i polynomisk tid samtidigt som den instansspecifika minimala provkomplexiteten bibehålls i kombinatoriska semi-banditer.

Till skillnad från provkomplexitet eller ångerminimering som diskuterats ovan, förblir karaktäriseringen av den grundläggande gränsen för felprocenten vid identifiering av den bästa armen med en fast budget en utmaning. Den fjärde uppsatsen tar upp denna utmaning i två-armede banditer, genom att introducera en ny klass av algoritmer, stabila algoritmer, som omfattar ett brett spektrum av rimliga algoritmer. Vi demonstrerar att ingen konsekvent och stabil algoritm överträffar algoritmen som provtar varje arm jämnt, vilket svarar på de öppna problem som formulerats i tidigare arbete. I allmänna multi-armede banditer presenterar den sista uppsatsen i denna avhandling, till vår kännedom, den första stora avvikelseteoremen för den generiska adaptiva algoritmen. Baserat på detta ger vi den exakta analysen av den berömda algoritmen, Successive Rejects. Dessutom låter denna nya stora avvikelseteknik oss att utforma och analysera en ny adaptiv algoritm, vilket är den nuvarande state-of-the-art till bästa av vår kunskap. Denna avhandling ger ny insikt för att härleda grundläggande gränser i olika online stokastiska inlärningsproblem. Denna förståelse vägleder oss att utveckla mer effektiva algoritmer och system.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2024. , p. 37
Series
TRITA-EECS-AVL ; 2024:64
Keywords [en]
Stochastic bandits, Reinforcement learning
National Category
Engineering and Technology
Identifiers
URN: urn:nbn:se:kth:diva-353342ISBN: 978-91-8106-033-1 (print)OAI: oai:DiVA.org:kth-353342DiVA, id: diva2:1898589
Public defence
2024-10-11, https://kth-se.zoom.us/j/61569027581, D37, Lindstedtsvägen 9, Stockholm, 10:00 (English)
Opponent
Supervisors
Note

QC 20240918

Available from: 2024-09-18 Created: 2024-09-17 Last updated: 2024-09-23Bibliographically approved
List of papers
1. Optimal Algorithms for Multiplayer Multi-Armed Bandits
Open this publication in new window or tab >>Optimal Algorithms for Multiplayer Multi-Armed Bandits
Show others...
2020 (English)In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR / [ed] Chiappa, S Calandra, R, ML Research Press , 2020Conference paper, Published paper (Refereed)
Abstract [en]

The paper addresses various Multiplayer Multi-Armed Bandit (MMAB) problems, where M decision-makers, or players, collaborate to maximize their cumulative reward. We first investigate the MMAB problem where players selecting the same arms experience a collision (and are aware of it) and do not collect any reward. For this problem, we present DPE1 (Decentralized Parsimonious Exploration), a decentralized algorithm that achieves the same asymptotic regret as that obtained by an optimal centralized algorithm. DPE1 is simpler than the state-of-the-art algorithm SIC-MMAB Boursier and Pen-het (2019), and yet offers better performance guarantees. We then study the MMAB problem without collision, where players may select the same arm. Players sit on vertices of a graph, and in each round, they are able to send a message to their neighbours in the graph. We present DPE2, a simple and asymptotically optimal algorithm that outperforms the state-of-the-art algorithm DD-UCB Martinez-Rubio et al. (2019). Besides, under DPE2, the expected number of bits transmitted by the players in the graph is finite.

Place, publisher, year, edition, pages
ML Research Press, 2020
Series
Proceedings of Machine Learning Research, ISSN 2640-3498 ; 108
National Category
Control Engineering
Identifiers
urn:nbn:se:kth:diva-285690 (URN)000559931303071 ()2-s2.0-85161889600 (Scopus ID)
Conference
23rd International Conference on Artificial Intelligence and Statistics (AISTATS), AUG 26-28, 2020, ELECTR NETWORK
Note

QC 20210310

Available from: 2021-03-10 Created: 2021-03-10 Last updated: 2024-09-18Bibliographically approved
2. Fast Pure Exploration via Frank-Wolfe
Open this publication in new window or tab >>Fast Pure Exploration via Frank-Wolfe
2021 (English)In: Advances in Neural Information Processing Systems, Neural information processing systems foundation , 2021, p. 5810-5821Conference paper, Published paper (Refereed)
Abstract [en]

We study the problem of active pure exploration with fixed confidence in generic stochastic bandit environments. The goal of the learner is to answer a query about the environment with a given level of certainty while minimizing her sampling budget. For this problem, instance-specific lower bounds on the expected sample complexity reveal the optimal proportions of arm draws an Oracle algorithm would apply. These proportions solve an optimization problem whose tractability strongly depends on the structural properties of the environment, but may be instrumental in the design of efficient learning algorithms. We devise Frank-Wolfe-based Sampling (FWS), a simple algorithm whose sample complexity matches the lower bounds for a wide class of pure exploration problems. The algorithm is computationally efficient as, to learn and track the optimal proportion of arm draws, it relies on a single iteration of Frank-Wolfe algorithm applied to the lower-bound optimization problem. We apply FWS to various pure exploration tasks, including best arm identification in unstructured, thresholded, linear, and Lipschitz bandits. Despite its simplicity, FWS is competitive compared to state-of-art algorithms.

Place, publisher, year, edition, pages
Neural information processing systems foundation, 2021
Keywords
Budget control, Computational complexity, Learning algorithms, Optimization, Query processing, Stochastic systems, Computationally efficient, Frank-wolfe algorithms, Learn+, Low bound, Optimization problems, Problem instances, Sample complexity, SIMPLE algorithm, SIMPLER algorithms, Stochastics, Iterative methods
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-316180 (URN)000922928201028 ()2-s2.0-85131830598 (Scopus ID)
Conference
35th Conference on Neural Information Processing Systems, NeurIPS 2021, 6 December - 14 December 2021, Virtual/Online
Note

Part of proceedings: ISBN 978-1-7138-4539-3

QC 20230921

Available from: 2022-09-27 Created: 2022-09-27 Last updated: 2024-09-18Bibliographically approved
3. Closing the Computational-Statistical Gap in Best Arm Identification for Combinatorial Semi-bandits
Open this publication in new window or tab >>Closing the Computational-Statistical Gap in Best Arm Identification for Combinatorial Semi-bandits
2023 (English)In: Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023, Neural Information Processing Systems Foundation , 2023Conference paper, Published paper (Refereed)
Abstract [en]

We study the best arm identification problem in combinatorial semi-bandits in the fixed confidence setting. We present Perturbed Frank-Wolfe Sampling (P-FWS), an algorithm that (i) runs in polynomial time, (ii) achieves the instance-specific minimal sample complexity in the high confidence regime, and (iii) enjoys polynomial sample complexity guarantees in the moderate confidence regime. To the best of our knowledge, even for the vanilla bandit problems, no algorithm was able to achieve (ii) and (iii) simultaneously. With P-FWS, we close the computational-statistical gap in best arm identification in combinatorial semi-bandits. The design of P-FWS starts from the optimization problem that defines the information-theoretical and instance-specific sample complexity lower bound. P-FWS solves this problem in an online manner using, in each round, a single iteration of the Frank-Wolfe algorithm. Structural properties of the problem are leveraged to make the P-FWS successive updates computationally efficient. In turn, P-FWS only relies on a simple linear maximization oracle.

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation, 2023
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-350250 (URN)001220600003008 ()2-s2.0-85185584821 (Scopus ID)
Conference
37th Conference on Neural Information Processing Systems, NeurIPS 2023, New Orleans, United States of America, Dec 10 2023 - Dec 16 2023
Note

QC 20240710

Available from: 2024-07-10 Created: 2024-07-10 Last updated: 2024-09-18Bibliographically approved
4. On Universally Optimal Algorithms for A/B Testing
Open this publication in new window or tab >>On Universally Optimal Algorithms for A/B Testing
2024 (English)In: Proceedings of the 41st International Conference on Machine Learning, International Conference on Machine Learning, 2024, Vol. 235, p. 50065-50091Conference paper, Published paper (Refereed)
Abstract [en]

We study the problem of best-arm identification with fixed budget in stochastic multi-armed bandits with Bernoulli rewards. For the problem with two arms, also known as the A/B testing problem, we prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (referred to as the uniform sampling algorithm) in all instances, and that (ii) strictly outperforms uniform sampling on at least one instance. In short, there is no algorithm better than the uniform sampling algorithm. To establish this result, we first introduce the natural class of consistent and stable algorithms, and show that any algorithm that performs as well as the uniform sampling algorithm in all instances belongs to this class. The proof then proceeds by deriving a lower bound on the error rate satisfied by any consistent and stable algorithm, and by showing that the uniform sampling algorithm matches this lower bound. Our results provide a solution to the two open problems presented in (Qin, 2022). For the general problem with more than two arms, we provide a first set of results. We characterize the asymptotic error rate of the celebrated Successive Rejects (SR) algorithm (Audibert et al., 2010) and show that, surprisingly, the uniform sampling algorithm outperforms the SR algorithm in some instances.

Place, publisher, year, edition, pages
International Conference on Machine Learning, 2024
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-352890 (URN)
Conference
International Conference on Machine Learning, 21-27 July 2024, Vienna, Austria
Note

QC 20240910

Available from: 2024-09-09 Created: 2024-09-09 Last updated: 2024-09-18
5. Best Arm Identification with Fixed Budget: A Large Deviation Perspective
Open this publication in new window or tab >>Best Arm Identification with Fixed Budget: A Large Deviation Perspective
2023 (English)In: Advances in Neural Information Processing Systems 36 - 37th Conference on Neural Information Processing Systems, NeurIPS 2023, Neural Information Processing Systems Foundation , 2023, p. 16804-16815Conference paper, Published paper (Refereed)
Abstract [en]

We consider the problem of identifying the best arm in stochastic Multi-Armed Bandits (MABs) using a fixed sampling budget. Characterizing the minimal instance-specific error probability for this problem constitutes one of the important remaining open problems in MABs. When arms are selected using a static sampling strategy, the error probability decays exponentially with the number of samples at a rate that can be explicitly derived via Large Deviation techniques. Analyzing the performance of algorithms with adaptive sampling strategies is however much more challenging. In this paper, we establish a connection between the Large Deviation Principle (LDP) satisfied by the empirical proportions of arm draws and that satisfied by the empirical arm rewards. This connection holds for any adaptive algorithm, and is leveraged (i) to improve error probability upper bounds of some existing algorithms, such as the celebrated SR (Successive Rejects) algorithm (Audibert et al., 2010), and (ii) to devise and analyze new algorithms. In particular, we present CR (Continuous Rejects), a truly adaptive algorithm that can reject arms in any round based on the observed empirical gaps between the rewards of various arms. Applying our Large Deviation results, we prove that CR enjoys better performance guarantees than existing algorithms, including SR. Extensive numerical experiments confirm this observation.

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation, 2023
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-349554 (URN)2-s2.0-85190785605 (Scopus ID)
Conference
37th Conference on Neural Information Processing Systems, NeurIPS 2023, New Orleans, United States of America, Dec 10 2023 - Dec 16 2023
Note

QC 20240702

Available from: 2024-07-02 Created: 2024-07-02 Last updated: 2024-09-18Bibliographically approved

Open Access in DiVA

Kappa(1230 kB)7 downloads
File information
File name SUMMARY01.pdfFile size 1230 kBChecksum SHA-512
a607140235e50972c64406ca46a53a8f77ad9595420d278c6ee641ce1fbbbfc24e2775966728a3f070196ceb6658a6416a4724fecb956dcbcec93ccb9b3eb063
Type summaryMimetype application/pdf
Full thesis(48101 kB)9 downloads
File information
File name FULLTEXT01.pdfFile size 48101 kBChecksum SHA-512
78986551d2145e61d2193db28b791e0465ed746a2ad7a44aa60abc9382d111e99a4b3e5ecbcd7185ed3a488fa5471292b4e11a49fa5cce50ee6e18f78f778dc4
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Wang, Po-An
By organisation
Decision and Control Systems (Automatic Control)
Engineering and Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 9 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 152 hits
23456785 of 20
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf