A Value Iteration Algorithm for Partially Observed Markov Decision Process Multi-armed Bandits
2004 (English)In: , 2004Conference paper (Refereed)
A value iteration based algorithm is given for computing the Gittins index of a Partially Observed Markov Decision Process (POMDP) Multi-armed Bandit problem. This problem concerns dynamical allocation of efforts between a number of competing projects of which only one can be worked on at any time period. The active project evolves according to a finite state Markov chain and generates then a reward, while the states of the idle projects remain fixed. In this contribution, it is assumed that the state of the active project only can be indirectly observed from noisy observations. The objective is to find the optimal policy based on partial information to determine which project to work on at a certain time in order to maximize the total expected reward. The solution is obtained by transforming the problem into a standard POMDP problem, for which there exist efficient near-optimal algorithms. A numerical example from the field of task planning for an autonomous robot is presented to illustrate the algorithms.
Place, publisher, year, edition, pages
IdentifiersURN: urn:nbn:se:kth:diva-57927OAI: oai:DiVA.org:kth-57927DiVA: diva2:472700
Mathematical Theory of Networks and Systems (MTNS), Leuven, Belgium
QC 201203012012-01-042012-01-042013-09-05Bibliographically approved