Partially Observed Markov Decision Process Multiarmed Bandits-Structural Results
2009 (English)In: Mathematics of Operations Research, ISSN 0364-765X, E-ISSN 1526-5471, Vol. 34, no 2, 287-302 p.Article in journal (Refereed) Published
This paper considers multiarmed bandit problems involving partially observed Markov decision processes (POMDPs). We show how the Gittins index for the optimal scheduling policy can be computed by a value iteration algorithm on each process, thereby considerably simplifying the computational cost. A suboptimal value iteration algorithm based on Lovejoy's approximation is presented. We then show that for the case of totally positive of order 2 (TP2) transition probability matrices and monotone likelihood ratio (MLR) ordered observation probabilities, the Gittins index is MLR increasing in the information state. Algorithms that exploit this structure are then presented.
Place, publisher, year, edition, pages
2009. Vol. 34, no 2, 287-302 p.
multiarmed bandits, partially observed Markov decision process, monotone policies, likelihood ratio ordering, opportunistic scheduling, stochastic approximation algorithm
IdentifiersURN: urn:nbn:se:kth:diva-18430DOI: 10.1287/moor.1080.0371ISI: 000266101500004ScopusID: 2-s2.0-67649963279OAI: oai:DiVA.org:kth-18430DiVA: diva2:336477
FunderSwedish Research Council
QC 20100525 QC 201201022010-08-052010-08-052013-09-05Bibliographically approved