We study the problem of regret minimization in Multi-Agent Multi-Armed Bandits (MAMABs) where the rewards are defined through a factor graph.We derive an instance-specific regret lower bound and characterize the minimal expected number of times each global action should be explored.This bound and the corresponding optimal exploration process are obtained by solving a combinatorial optimization problem whose set of variables and constraints exponentially grow with the number of agents, and cannot be exploited in the design of efficient algorithms.Inspired by Mean Field approximation techniques used in graphical models, we provide simple upper bounds of the regret lower bound.The corresponding optimization problems have a reduced number of variables and constraints.By tuning the latter, we may explore the trade-off between the achievable regret and the complexity of computing the corresponding exploration process.We devise Efficient Sampling for MAMAB (ESM), an algorithm whose regret asymptotically matches the approximated lower bounds.The regret and computational complexity of ESM are assessed numerically, using both synthetic and real-world experiments in radio communications networks.
QC 20240507