We study the performance of the Thompson Sampling algorithm for logistic bandit problems. In this setting, an agent receives binary rewards with probabilities determined by a logistic function, exp(β[removed])/(1+ exp(β[removed])), with parameter β > 0, and both the action a ∈ A and the unknown parameter θ ∈ O lie within the d-dimensional unit ball. Adopting the information-theoretic framework introduced by Russo & Van Roy (2016), we derive regret bounds via the analysis of the information ratio, a statistic that quantifies the trade-off between the immediate regret incurred by the agent and the information it just gained about the parameter θ. We improve upon previous results and establish that the information ratio is bounded by d(4/α)2, where d is the dimension of the problem and α is a minimax measure of the alignment between the action space A and the parameter space O. Notably, our bound does not scale exponentially with the logistic slope and is independent of the cardinality of the action and parameter spaces. Using this result, we derive a bound on the Thompson Sampling expected regret of order (Formula Presented) where T is the number of time steps. To our knowledge, this is the first regret bound for any logistic bandit algorithm that avoids any exponential scaling with β and is independent of the number of actions. In particular, when the parameters are on the sphere and the action space contains the parameter space, the expected regret bound is of order (Formula Presented).
Not duplicate with DiVA 2015716
QC 20260330