Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Changing the Random Behavior of a Q-Learning Agent over Time.
KTH, School of Computer Science and Communication (CSC).
KTH, School of Computer Science and Communication (CSC).
2011 (Swedish)Independent thesis Advanced level (professional degree), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

Q-learning is a Reinforcement learning technique where an AI agent learns from experiences. This technique is commonly used together with the so-called epsilon-greedy policy. The goal of this thesis was to determine how a few different random-behavior policies could affect learning rate of the Q-learning agent. To test this our agent played a reduced instance of the board game Blokus on a 5 by 5 board, primarily against a random- playing opponent. During these tests two different policies were tested that both started with the agent preferring random moves and gradually moving over to trusting its previous experiences. Both new policies introduced were able to converge to a close to 100% win rate. The study showed to be inconclusive however as the game instance was very limited and with our implementation the agent was able to beat it without any random behavior at all. Their similar performance to a relatively non-exploring strategy along with theoretical motivation for their use indicate that further research on them is motivated.

Abstract [sv]

Q-learning är en belöningsbaserad inlärningsteknik där en AI-agent lär sig genom erfarenheter. Den här tekniken förekommer vanligen ihop med en policy som kallas epsilon-greedy. Målet med det här arbetet var att bestämma hur olika policies påverkade inlärningsgraden för den Q-learningbaserade agenten. För att testa agenten spelades en mindre instans av brädspelet Blokus på ett 5 gånger 5-bräde, först och främst mot en motspelare som lade sina brickor fullständigt slumpmässigt. Under testerna undersöktes två olika policies som båda startade med att agenten föredrog planlösa drag för att sedan gradvis gå över till att lita mer och mer på sina tidigare erfarenheter. Båda nya policies gjorde att agentens beteende konvergerade till nära 100% vinstfrekvens. Studierna visade sig dock vara ofullständiga på grund av att spelinstansen var väldigt begränsad och att agenten i vår implementation klarade av instansen utan något slumpat beteende över huvud taget. Resultatet gav dock att båda nya policies hade liknande prestanda på denna instans. Tillsammans med teoretiska argument för de nya policiernas användning indikerar detta på att fortsatt undersökning inom området är motiverad.

Place, publisher, year, edition, pages
2011.
Series
Kandidatexjobb CSC, K11041
National Category
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-130810OAI: oai:DiVA.org:kth-130810DiVA: diva2:654257
Educational program
Master of Science in Engineering - Computer Science and Technology
Uppsok
Technology
Supervisors
Examiners
Available from: 2013-10-07 Created: 2013-10-07

Open Access in DiVA

No full text

Other links

http://www.csc.kth.se/utbildning/kandidatexjobb/datateknik/2011/rapport/bostrom_peter_OCH_modee_anna_maria_K11041.pdf
By organisation
School of Computer Science and Communication (CSC)
Computer Science

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 75 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf