kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Graph Bandits: Multi-Armed Bandits with Locality Constraints
KTH, School of Electrical Engineering and Computer Science (EECS).
2022 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesisAlternative title
Grafbanditer : Flerarmade banditer med lokala restriktioner (Swedish)
Abstract [en]

Multi-armed bandits (MABs) have been studied extensively in the literature and have applications in a wealth of domains, including recommendation systems, dynamic pricing, and investment management. On the one hand, the current MAB literature largely seems to focus on the setting where each arm is available to play at each time step, and ignores how agents move between the arms. On the other hand, there is work that takes the movement between arms into account, but this work models the problem as a Markov decision process and applies generic reinforcement learning (RL) algorithms, like Q-learning. This thesis examines an extension of the MAB problem to a setting where the set of available arms at each round depends on which arm was played in the previous round. In this formulation the arms are nodes in a graph, and arms that can be played successively are connected via edges. We denote this the Graph Bandit (GB) problem. We show that under certain conditions the optimal action is governed by a stationary policy. Furthermore, we develop an algorithm that leverages the graphical structure of the problem to find this policy when the reward distributions are perfectly known, and denote this algorithm the Q-graph. When the reward distributions are unknown, we show how to leverage the Qgraph algorithm together with standard sampling algorithms like Thompson sampling and upper confidence bound to create an online learning algorithm that provably achieves logarithmic regret. Finally, this regret-bound is supported in numerical simulations, and it is illustrated how the proposed Q-graph algorithm outperforms generic algorithms from the MAB and RL communities.

Abstract [sv]

Flerarmade banditer (FAB) har studerats omfattande i litteraturen och har applikationer inom en mängd domäner, såsom rekommendationssystem, dynamisk prissättning och finans. Å ena sidan verkar det som at en stor del av litteraturen fokuserar på situationen där alla armar är tillgängliga att spela vid varje tidssteg och ignorerar hur agenten rör sig mellan armarna. Å andra sidan finns det arbete som tar till hänsyn hur agenten rör sig mellan armarna men det arbetet modellerar systemet som en Markovprocess och använder sig av generiska inlärningsmetoder, såsom Q-learning. Den här uppsatsen undersöker en utvidgning av FAB-problemet till en situation där mängden tillgänliga armar vid varje runda beror på vilken arm som spelades i den föregående rundan. I denna formulering är armarna noder i en graf och armar som kan spelas i på varandra följande rundor är anslutna via kanter. Vi kallar det här problemt Grafbanditen. Vi visar att under vissa förutsättningar bestäms det optimala aggerandet av en stationär policy. Vi utvecklar också en algoritm som utnyttjar den grafiska strukturen i problemet för att beräkna denna policy när distributionerna hos alla armar är kända. Denna algoritm får namnet Q-grafen. När distributionerna är okända visar vi hur Q-grafen kan användas tillsammans med Thompson sampling eller upper confidence bound-metoder för att skapa en online inlärningsalgoritm som bevisligen uppnår logaritmisk regret. Slutligen stöds de teoretiska resultaten via numeriska simuleringar som illustrerar att Q-grafen är överlägsen många generiska inlärningsalgoritmer.

Place, publisher, year, edition, pages
2022. , p. 59
Series
TRITA-EECS-EX ; 2022:525
Keywords [en]
Multi-armed bandits, locality constraints, reinforcement learning
Keywords [sv]
Flerarmade banditer, lokala restriktioner, förstärkningsinlärning
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-320733OAI: oai:DiVA.org:kth-320733DiVA, id: diva2:1707379
External cooperation
Harvard University
Supervisors
Examiners
Available from: 2022-11-02 Created: 2022-10-31 Last updated: 2022-11-02Bibliographically approved

Open Access in DiVA

fulltext(2359 kB)346 downloads
File information
File name FULLTEXT01.pdfFile size 2359 kBChecksum SHA-512
b5376fce6178298aa5f11c6c1d7fc37cb10201eddb98e9b0dfddb2e9af719ed11d75849c747405d668fea425378e31684dc800704fd49102ce93afdb6b12a9ff
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 346 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 485 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf