kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Aligning Human Preferences with Baseline Objectives in Reinforcement Learning
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-3510-5481
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-7461-920X
KTH, School of Electrical Engineering and Computer Science (EECS), Centres, Centre for Autonomous Systems, CAS. KTH, School of Electrical Engineering and Computer Science (EECS), Centres, ACCESS Linnaeus Centre. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-4173-2593
Show others and affiliations
2023 (English)In: 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Practical implementations of deep reinforcement learning (deep RL) have been challenging due to an amplitude of factors, such as designing reward functions that cover every possible interaction. To address the heavy burden of robot reward engineering, we aim to leverage subjective human preferences gathered in the context of human-robot interaction, while taking advantage of a baseline reward function when available. By considering baseline objectives to be designed beforehand, we are able to narrow down the policy space, solely requesting human attention when their input matters the most. To allow for control over the optimization of different objectives, our approach contemplates a multi-objective setting. We achieve human-compliant policies by sequentially training an optimal policy from a baseline specification and collecting queries on pairs of trajectories. These policies are obtained by training a reward estimator to generate Pareto optimal policies that include human preferred behaviours. Our approach ensures sample efficiency and we conducted a user study to collect real human preferences, which we utilized to obtain a policy on a social navigation environment.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2023.
Series
IEEE International Conference on Robotics and Automation ICRA
National Category
Robotics and automation
Identifiers
URN: urn:nbn:se:kth:diva-324924DOI: 10.1109/ICRA48891.2023.10161261ISI: 001048371100079Scopus ID: 2-s2.0-85164820716OAI: oai:DiVA.org:kth-324924DiVA, id: diva2:1744884
Conference
IEEE International Conference on Robotics and Automation (ICRA), MAY 29-JUN 02, 2023, London, ENGLAND
Note

Part of ISBN 979-8-3503-2365-8

QC 20230328

Available from: 2023-03-21 Created: 2023-03-21 Last updated: 2025-05-19Bibliographically approved
In thesis
1. Towards safe, aligned, and efficient reinforcement learning from human feedback
Open this publication in new window or tab >>Towards safe, aligned, and efficient reinforcement learning from human feedback
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Reinforcement learning policies are becoming increasingly prevalent in robotics and AI-human interactions due to their effectiveness in tackling complex and challenging domains. Many of these policies—also referred to as AI agents—are trained using human feedback through techniques collectively known as Reinforcement Learning from Human Feedback (RLHF). This thesis addresses three key challenges—safety, alignment, and efficiency—that arise when deploying these policies in real-world applications involving actual human users. To this end, it proposes several novel methods. Ensuring the safety of human-robot interaction is a fundamental requirement for their deployment. While most prior research has explored safety within discrete state and action spaces, we investigate novel approaches for synthesizing safety shields from human feedback, enabling safer policy execution in various challenging settings, including continuous state and action spaces, such as social navigation. To better align policies with human feedback, contemporary works predominantly rely on single-reward settings. However, we argue for the necessity of a multi-objective paradigm, as most human goals cannot be captured by a single valued reward function. Moreover, most robotic tasks have baseline predefined goals related to task success, such as reaching a navigation waypoint. Accordingly, we first introduce a method to align policies with multiple objectives using pairwise preferences. Additionally, we propose a novel multi-modal approach that leverages zero-shot reasoning with large language models alongside pairwise preferences to adapt multi-objective goals for these policies. The final challenge addressed in this thesis is improving the sample efficiency and reusability of these policies, which is crucial when adapting policies based on real human feedback. Since requesting human feedback is both costly and burdensome—potentially degrading the quality of human-agent interactions—we propose two distinct methods to mitigate these issues. First, to enhance the efficiency of RLHF, we introduce an active learning method that combines unsupervised learning techniques with uncertainty estimation to prioritize the most informative queries for human feedback. Second, to improve the reusability of reward functions derived from human feedback and reduce the need for redundant queries in similar tasks, we investigate low-rank adaptation techniques for adapting pre-trained reward functions to new tasks.

Abstract [sv]

Reinforcement learning-policyer blir allt vanligare inom robotik och AI-mänsklig interaktion tack vare deras effektivitet i att hantera komplexa och utmanande domäner. Många av dessa policyer – även kallade AI-agenter – tränas med hjälp av mänsklig återkoppling genom tekniker som kollektivt benämns Reinforcement Learning from Human Feedback (RLHF). Denna avhandling tar upp tre centrala utmaningar – säkerhet, anpassning och effektivitet – som uppstår vid implementering av dessa policyer i verkliga tillämpningar som involverar faktiska mänskliga användare. För detta ändamål föreslås flera nya metoder. Att säkerställa säkerheten i människa-robot-interaktion är en grundläggande förutsättning för deras implementering. Medan tidigare forskning främst har undersökt säkerhet inom diskreta tillstånds- och aktionsrum, undersöker vi nya metoder för att syntetisera säkerhetssköldar utifrån mänsklig återkoppling, vilket möjliggör säkrare policyutförande i olika utmanande miljöer, inklusive kontinuerliga tillstånds- och aktionsrum, såsom social navigation. För att bättre anpassa policyer till mänsklig återkoppling förlitar sig moderna arbeten huvudsakligen på inställningar med enstaka belöningar. Vi argumenterar dock för behovet av ett multiobjektivparadigm, eftersom de flesta mänskliga mål inte kan fångas av en belöningsfunktion med ett enda värde. Dessutom har de flesta robotuppgifter fördefinierade basmål kopplade till uppgiftsframgång, såsom att nå en navigationspunkt. Följaktligen introducerar vi först en metod för att anpassa policyer till flera mål genom parvisa preferenser. Dessutom föreslår vi en ny multimodal metod som utnyttjar zeroshot-reasoning med stora språkmodeller tillsammans med parvisa preferenser för att anpassa multiobjektiva mål för dessa policyer. Den sista utmaningen som behandlas i denna avhandling är att förbättra sampeleffektiviteten och återanvändbarheten hos dessa policyer, vilket är avgörande vid anpassning av policyer baserat på verklig mänsklig återkoppling. Eftersom insamling av mänsklig återkoppling både är kostsamt och betungande – och potentiellt försämrar kvaliteten på människa-agent-interaktioner – föreslår vi två olika metoder för att minska dessa problem. För det första introducerar vi en aktiv inlärningsmetod för att förbättra effektiviteten av RLHF genom att kombinera osuperviserade inlärningstekniker med osäkerhetsuppskattning för att prioritera de mest informativa förfrågningarna om mänsklig återkoppling. För det andra undersöker vi low-rank-anpassningstekniker för att anpassa förtränade belöningsfunktioner till nya uppgifter, vilket förbättrar återanvändbarheten av belöningsfunktioner från mänsklig återkoppling och minskar behovet av redundanta förfrågningar i liknande uppgifter.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. p. xi, 77
Series
TRITA-EECS-AVL ; 2025:49
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363515 (URN)978-91-8106-275-5 (ISBN)
Public defence
2025-06-05, Q2, Malvinas väg 10, Stockholm, 15:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20250519

Available from: 2025-05-20 Created: 2025-05-19 Last updated: 2025-06-30Bibliographically approved

Open Access in DiVA

fulltext(5631 kB)910 downloads
File information
File name FULLTEXT01.pdfFile size 5631 kBChecksum SHA-512
e28f48477805b89162652c0fe7dcb2beedc5dcf9a87f8ca90a9ad6b880eb529c8a00a93b24634d3a772cb9a3d5c1cc9678bc21c8292e1bb29509df591bbfccf6
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Marta, DanielHolk, SimonPek, ChristianTumova, JanaLeite, Iolanda

Search in DiVA

By author/editor
Marta, DanielHolk, SimonPek, ChristianTumova, JanaLeite, Iolanda
By organisation
Robotics, Perception and Learning, RPLCentre for Autonomous Systems, CASACCESS Linnaeus Centre
Robotics and automation

Search outside of DiVA

GoogleGoogle Scholar
Total: 910 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 1863 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf