kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandit Problems
KTH, School of Electrical Engineering and Computer Science (EECS).ORCID iD: 0000-0003-2598-4459
KTH, School of Electrical Engineering and Computer Science (EECS), Information Science and Engineering.ORCID iD: 0000-0002-0036-9049
KTH, School of Electrical Engineering and Computer Science (EECS), Information Science and Engineering.ORCID iD: 0000-0002-7926-5081
2026 (English)In: Transactions on Machine Learning Research, E-ISSN 2835-8856, Vol. 2026-FebruaryArticle in journal (Refereed) Published
Abstract [en]

We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, exp(β[removed])/(1+ exp(β[removed])), with parameter β > 0, and both the action a ∈ A and the unknown parameter θ ∈ O lie within the d-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo & Van Roy (2016), we derive regret bounds via the analysis of the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred by the agent and the information it just gained about the parameter θ. We improve upon previous results and establish that the information ratio is bounded by d(4/α)2, where d is the dimension of the problem and α is a minimax measure of the alignment between the action space A and the parameter space O. Notably, our bound does not scale exponentially with the logistic slope and is independent of the cardinality of the action and parameter spaces. Using this result, we derive a bound on the Thompson Sampling expected regret of order (Formula Presented) where T is the number of time steps. To our knowledge, this is the first regret bound for any logistic bandit algorithm that avoids any exponential scaling with β and is independent of the number of actions. In particular, when the parameters are on the sphere and the action space contains the parameter space, the expected regret bound is of order (Formula Presented).

Place, publisher, year, edition, pages
Transactions on Machine Learning Research , 2026. Vol. 2026-February
National Category
Computer Sciences Probability Theory and Statistics Mathematical Analysis
Identifiers
URN: urn:nbn:se:kth:diva-378773Scopus ID: 2-s2.0-105032794057OAI: oai:DiVA.org:kth-378773DiVA, id: diva2:2049526
Note

Not duplicate with DiVA 2015716

QC 20260330

Available from: 2026-03-30 Created: 2026-03-30 Last updated: 2026-03-30Bibliographically approved

Open Access in DiVA

No full text in DiVA

Scopus

Authority records

Gouverneur, AmauryOechtering, Tobias J.Skoglund, Mikael

Search in DiVA

By author/editor
Gouverneur, AmauryOechtering, Tobias J.Skoglund, Mikael
By organisation
School of Electrical Engineering and Computer Science (EECS)Information Science and Engineering
In the same journal
Transactions on Machine Learning Research
Computer SciencesProbability Theory and StatisticsMathematical Analysis

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 19 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf