kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
SEQUEL: Semi-Supervised Preference-based RL with Query Synthesis via Latent Interpolation
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. Digital Futures.ORCID iD: 0000-0002-3510-5481
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. Digital Futures.ORCID iD: 0000-0001-5727-8140
Dept. of Cognitive Robotics, TU Delft, Delft, The Netherlands.ORCID iD: 0000-0001-7461-920X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. Digital Futures.ORCID iD: 0000-0002-2212-4325
2024 (English)In: 2024 IEEE International Conference on Robotics and Automation (ICRA), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 9585-9592Conference paper, Published paper (Refereed)
Abstract [en]

Preference-based reinforcement learning (RL) poses as a recent research direction in robot learning, by allowing humans to teach robots through preferences on pairs of desired behaviours. Nonetheless, to obtain realistic robot policies, an arbitrarily large number of queries is required to be answered by humans. In this work, we approach the sample-efficiency challenge by presenting a technique which synthesizes queries, in a semi-supervised learning perspective. To achieve this, we leverage latent variational autoencoder (VAE) representations of trajectory segments (sequences of state-action pairs). Our approach manages to produce queries which are closely aligned with those labeled by humans, while avoiding excessive uncertainty according to the human preference predictions as determined by reward estimations. Additionally, by introducing variation without deviating from the original human’s intents, more robust reward function representations are achieved. We compare our approach to recent state-of-the-art preference-based RL semi-supervised learning techniques. Our experimental findings reveal that we can enhance the generalization of the estimated reward function without requiring additional human intervention. Lastly, to confirm the practical applicability of our approach, we conduct experiments involving actual human users in a simulated social navigation setting. Videos of the experiments can be found at https://sites.google.com/view/rl-sequel

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024. p. 9585-9592
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-360978DOI: 10.1109/ICRA57147.2024.10610534ISI: 001369728000064Scopus ID: 2-s2.0-85199009127OAI: oai:DiVA.org:kth-360978DiVA, id: diva2:1942922
Conference
IEEE International Conference on Robotics and Automation (ICRA)
Note

Part of ISBN 9798350384574

QC 20250310

Available from: 2025-03-07 Created: 2025-03-07 Last updated: 2025-03-10Bibliographically approved
In thesis
1. Improving Sample-efficiency of Reinforcement Learning from Human Feedback
Open this publication in new window or tab >>Improving Sample-efficiency of Reinforcement Learning from Human Feedback
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

With the rapid advancement of AI, the technology has moved out of the industrial and lab setting and into the hands of everyday people. Once AI and robot agents are placed in everyday households they need to be able to take into account human needs. With methods like Reinforcement Learning from Human Feedback (RLHF), the agent can learn desirable behavior by learning a reward function or optimizing a policy directly based on their feedback. Unlike vision models and large language models (LLM) which benefit from internet-scale data, RLHF is limited by the amount of feedback provided since it requires additional human effort. In this thesis, we look into how we can decrease the amount of feedback humans provide to reduce their burden when estimating a reward function without degrading the estimate. We investigate the fundamental trade-off between the informativeness and efficiency of feedback from a preference-based learning perspective. In this regard, we introduce multiple methods that can be categorized into two groups, implicit methods that increase the quality of the feedback without additional human effort, and explicit methods that aim to drastically increase the information content by using additional feedback types. To implicitly improve the efficiency of preference feedback, we look into how we can utilize Active Learning (AL) to improve the diversity of samples by strategically picking from different clusters in a learned representation through a Variational Autoencoder (VAE). Furthermore, we make use of the unique relationship between preference pairs to perform data synthesis by interpolation on the latent space of the VAE. While the implicit methods have the benefit of requiring no extra effort, they still suffer from the limited amount of information that preferences alone can provide. One limitation of preferences on trajectories is that there is no discounting which means that if a trajectory is preferred, the assumption is that the whole trajectory is preferred leading to casual confusion. Therefore, we introduce a new form of feedback called highlights that lets the user show on the trajectory, which part was good and which part was bad. Furthermore, leveraging LLMs we create a method for letting humans explain their preferences via natural language to deduce which parts were preferred. Overall, this thesis takes a step away from the assumption of internet-scale data and shows how we can achieve alignment from less human feedback.

Abstract [sv]

Med den snabba utvecklingen av AI har teknologin lämnat den industriella och laboratoriebaserade miljön och hamnat i händerna på vanliga människor. När AI- och robotagenter placeras i vardagliga hushåll måste de kunna ta hänsyn till mänskliga behov. Med metoder som Reinforcement Learning from Human Feedback (RLHF) kan en agent lära sig önskvärt beteende genom att antingen lära sig en belöningsfunktion eller optimera en policy direkt baserat på mänsklig feedback. Till skillnad från visionsmodeller och stora språkmodeller (LLM), som gynnas av internet-skaliga datamängder, är RLHF begränsat av mängden feedback som ges, eftersom det kräver ytterligare mänsklig insats.I denna avhandling undersöker vi hur man kan minska mängden feedback som människor behöver ge för att minska deras börda vid estimering av en belöningsfunktion, utan att försämra uppskattningen. Vi undersöker den fundamentala avvägningen mellan informationsinnehållet och effektiviteten i feedback från ett preferensbaserat inlärningsperspektiv. I detta avseende introducerar vi flera metoder som kan kategoriseras i två grupper: implicita metoder, som förbättrar kvaliteten på feedback utan extra mänsklig ansträngning, och explicita metoder, som syftar till att drastiskt öka informationsinnehållet genom att använda ytterligare typer av feedback.För att implicit förbättra effektiviteten av preferensfeedback undersöker vi hur Active Learning (AL) kan användas för att förbättra mångfalden av urval genom att strategiskt välja från olika kluster i en inlärd representation med hjälp av en Variational Autoencoder (VAE). Vidare utnyttjar vi den unika relationen mellan preferenspar för att utföra datasyntes genom interpolation i VAE:s latenta utrymme.Även om de implicita metoderna har fördelen att de inte kräver extra ansträngning, lider de fortfarande av den begränsade mängd information som preferenser ensamma kan ge. En begränsning med preferenser på trajektorier är att det saknas diskontering, vilket innebär att om en trajektori föredras, antas det att hela trajektorin föredras, vilket kan leda till kausal förvirring. Därför introducerar vi en ny form av feedback, kallad highlights, där användaren kan markera på trajektorier vilka delar som var bra och vilka som var dåliga. Vidare utnyttjar vi LLM:er för att skapa en metod där människor kan förklara sina preferenser genom naturligt språk för att dra slutsatser om vilka delar som föredrogs.Sammanfattningsvis tar denna avhandling ett steg bort från antagandet om internet-skaliga datamängder och visar hur vi kan uppnå anpassning med mindre mänsklig feedback.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. p. ix, 64
Series
TRITA-EECS-AVL ; 2025:31
Keywords
RLHF, Reinforcement Learning from Human Feedback, Reinforcement Learning, Machine Learning
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-360983 (URN)978-91-8106-221-2 (ISBN)
Public defence
2025-04-01, https://kth-se.zoom.us/j/62755931085, F3 (Flodis), Lindstedtsvägen 26, Stockholm, 14:00 (English)
Opponent
Supervisors
Note

QC 20250307

Available from: 2025-03-07 Created: 2025-03-07 Last updated: 2025-03-17Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Marta, DanielHolk, SimonPek, ChristianLeite, Iolanda

Search in DiVA

By author/editor
Marta, DanielHolk, SimonPek, ChristianLeite, Iolanda
By organisation
Robotics, Perception and Learning, RPL
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 45 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf