kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 11) Show all publications
Marta, D., Holk, S., Vasco, M., Lundell, J., Homberger, T., Busch, F. L., . . . Leite, I. (2025). FLoRA: Sample-Efficient Preference-based RL via Low-Rank Style Adaptation of Reward Functions. In: : . Paper presented at IEEE International Conference on Robotics and Automation (ICRA), Atlanta, USA, 19-23 May 2025. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>FLoRA: Sample-Efficient Preference-based RL via Low-Rank Style Adaptation of Reward Functions
Show others...
2025 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Preference-based reinforcement learning (PbRL) is a suitable approach for style adaptation of pre-trained robotic behavior: adapting the robot's policy to follow human user preferences while still being able to perform the original task. However, collecting preferences for the adaptation process in robotics is often challenging and time-consuming. In this work we explore the adaptation of pre-trained robots in the low-preference-data regime. We show that, in this regime, recent adaptation approaches suffer from catastrophic reward forgetting (CRF), where the updated reward model overfits to the new preferences, leading the agent to become unable to perform the original task. To mitigate CRF, we propose to enhance the original reward model with a small number of parameters (low-rank matrices) responsible for modeling the preference adaptation. Our evaluation shows that our method can efficiently and effectively adjust robotic behavior to human preferences across simulation benchmark tasks and multiple real-world robotic tasks. We provide videos of our results and source code at https://sites.google.com/view/preflora/

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-360980 (URN)
Conference
IEEE International Conference on Robotics and Automation (ICRA), Atlanta, USA, 19-23 May 2025
Note

QC 20250618

Available from: 2025-03-07 Created: 2025-03-07 Last updated: 2025-06-18Bibliographically approved
Marta, D. (2025). Towards safe, aligned, and efficient reinforcement learning from human feedback. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Towards safe, aligned, and efficient reinforcement learning from human feedback
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Reinforcement learning policies are becoming increasingly prevalent in robotics and AI-human interactions due to their effectiveness in tackling complex and challenging domains. Many of these policies—also referred to as AI agents—are trained using human feedback through techniques collectively known as Reinforcement Learning from Human Feedback (RLHF). This thesis addresses three key challenges—safety, alignment, and efficiency—that arise when deploying these policies in real-world applications involving actual human users. To this end, it proposes several novel methods. Ensuring the safety of human-robot interaction is a fundamental requirement for their deployment. While most prior research has explored safety within discrete state and action spaces, we investigate novel approaches for synthesizing safety shields from human feedback, enabling safer policy execution in various challenging settings, including continuous state and action spaces, such as social navigation. To better align policies with human feedback, contemporary works predominantly rely on single-reward settings. However, we argue for the necessity of a multi-objective paradigm, as most human goals cannot be captured by a single valued reward function. Moreover, most robotic tasks have baseline predefined goals related to task success, such as reaching a navigation waypoint. Accordingly, we first introduce a method to align policies with multiple objectives using pairwise preferences. Additionally, we propose a novel multi-modal approach that leverages zero-shot reasoning with large language models alongside pairwise preferences to adapt multi-objective goals for these policies. The final challenge addressed in this thesis is improving the sample efficiency and reusability of these policies, which is crucial when adapting policies based on real human feedback. Since requesting human feedback is both costly and burdensome—potentially degrading the quality of human-agent interactions—we propose two distinct methods to mitigate these issues. First, to enhance the efficiency of RLHF, we introduce an active learning method that combines unsupervised learning techniques with uncertainty estimation to prioritize the most informative queries for human feedback. Second, to improve the reusability of reward functions derived from human feedback and reduce the need for redundant queries in similar tasks, we investigate low-rank adaptation techniques for adapting pre-trained reward functions to new tasks.

Abstract [sv]

Reinforcement learning-policyer blir allt vanligare inom robotik och AI-mänsklig interaktion tack vare deras effektivitet i att hantera komplexa och utmanande domäner. Många av dessa policyer – även kallade AI-agenter – tränas med hjälp av mänsklig återkoppling genom tekniker som kollektivt benämns Reinforcement Learning from Human Feedback (RLHF). Denna avhandling tar upp tre centrala utmaningar – säkerhet, anpassning och effektivitet – som uppstår vid implementering av dessa policyer i verkliga tillämpningar som involverar faktiska mänskliga användare. För detta ändamål föreslås flera nya metoder. Att säkerställa säkerheten i människa-robot-interaktion är en grundläggande förutsättning för deras implementering. Medan tidigare forskning främst har undersökt säkerhet inom diskreta tillstånds- och aktionsrum, undersöker vi nya metoder för att syntetisera säkerhetssköldar utifrån mänsklig återkoppling, vilket möjliggör säkrare policyutförande i olika utmanande miljöer, inklusive kontinuerliga tillstånds- och aktionsrum, såsom social navigation. För att bättre anpassa policyer till mänsklig återkoppling förlitar sig moderna arbeten huvudsakligen på inställningar med enstaka belöningar. Vi argumenterar dock för behovet av ett multiobjektivparadigm, eftersom de flesta mänskliga mål inte kan fångas av en belöningsfunktion med ett enda värde. Dessutom har de flesta robotuppgifter fördefinierade basmål kopplade till uppgiftsframgång, såsom att nå en navigationspunkt. Följaktligen introducerar vi först en metod för att anpassa policyer till flera mål genom parvisa preferenser. Dessutom föreslår vi en ny multimodal metod som utnyttjar zeroshot-reasoning med stora språkmodeller tillsammans med parvisa preferenser för att anpassa multiobjektiva mål för dessa policyer. Den sista utmaningen som behandlas i denna avhandling är att förbättra sampeleffektiviteten och återanvändbarheten hos dessa policyer, vilket är avgörande vid anpassning av policyer baserat på verklig mänsklig återkoppling. Eftersom insamling av mänsklig återkoppling både är kostsamt och betungande – och potentiellt försämrar kvaliteten på människa-agent-interaktioner – föreslår vi två olika metoder för att minska dessa problem. För det första introducerar vi en aktiv inlärningsmetod för att förbättra effektiviteten av RLHF genom att kombinera osuperviserade inlärningstekniker med osäkerhetsuppskattning för att prioritera de mest informativa förfrågningarna om mänsklig återkoppling. För det andra undersöker vi low-rank-anpassningstekniker för att anpassa förtränade belöningsfunktioner till nya uppgifter, vilket förbättrar återanvändbarheten av belöningsfunktioner från mänsklig återkoppling och minskar behovet av redundanta förfrågningar i liknande uppgifter.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. p. xi, 77
Series
TRITA-EECS-AVL ; 2025:49
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363515 (URN)978-91-8106-275-5 (ISBN)
Public defence
2025-06-05, Q2, Malvinas väg 10, Stockholm, 15:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20250519

Available from: 2025-05-20 Created: 2025-05-19 Last updated: 2025-05-20Bibliographically approved
Holk, S., Marta, D. & Leite, I. (2024). Polite: Preferences combined with highlights in reinforcement learning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA): . Paper presented at 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, May 13-17, 2024 (pp. 2288-2295). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Polite: Preferences combined with highlights in reinforcement learning
2024 (English)In: 2024 IEEE International Conference on Robotics and Automation (ICRA), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 2288-2295Conference paper, Published paper (Refereed)
Abstract [en]

Many solutions to address the challenge of robot learning have been devised, namely through exploring novel ways for humans to communicate complex goals and tasks in reinforcement learning (RL) setups. One way that experienced recent research interest directly addresses the problem by considering human feedback as preferences between pairs of trajectories (sequences of state-action pairs). However, when simply attributing a single preference to a pair of trajectories that contain many agglomerated steps, key pieces of information are lost in the process. We amplify the initial definition of preferences to account for highlights: state-action pairs of relatively high information (high/low reward) within a preferred trajectory. To include the additional information, we design novel regularization methods within a preference learning framework. To this extent, we present our method which is able to greatly reduce the necessary amount of preferences, by permitting the highlighting of favoured trajectories, in order to reduce the entropy of the credit assignment. We show the effectiveness of our work in both simulation and a user study, which analyzes the feedback given and its implications. We also use the total collected feedback to train a robot policy for socially compliant trajectories in a simulated social navigation environment. We release code and video examples at https://sites.google.com/view/rl-polite

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-360979 (URN)10.1109/ICRA57147.2024.10610505 (DOI)001294576201130 ()2-s2.0-85198995464 (Scopus ID)
Conference
2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, May 13-17, 2024
Note

Part of ISBN 979-8-3503-8457-4

QC 20250307

Available from: 2025-03-07 Created: 2025-03-07 Last updated: 2025-05-05Bibliographically approved
Holk, S., Marta, D. & Leite, I. (2024). PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning. In: HRI 2024 - Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction: . Paper presented at 19th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2024, Boulder, United States of America, Mar 11 2024 - Mar 15 2024 (pp. 259-268). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>PREDILECT: Preferences Delineated with Zero-Shot Language-based Reasoning in Reinforcement Learning
2024 (English)In: HRI 2024 - Proceedings of the 2024 ACM/IEEE International Conference on Human-Robot Interaction, Association for Computing Machinery (ACM) , 2024, p. 259-268Conference paper, Published paper (Refereed)
Abstract [en]

Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights - state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Series
ACM/IEEE International Conference on Human-Robot Interaction, ISSN 2167-2148
Keywords
Human-in-the-loop Learning, Interactive learning, Preference learning, Reinforcement learning
National Category
Robotics and automation
Identifiers
urn:nbn:se:kth:diva-344936 (URN)10.1145/3610977.3634970 (DOI)2-s2.0-85188450390 (Scopus ID)
Conference
19th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2024, Boulder, United States of America, Mar 11 2024 - Mar 15 2024
Note

QC 20240404

Part of ISBN 979-840070322-5

Available from: 2024-04-03 Created: 2024-04-03 Last updated: 2025-03-07Bibliographically approved
Marta, D., Holk, S., Pek, C. & Leite, I. (2024). SEQUEL: Semi-Supervised Preference-based RL with Query Synthesis via Latent Interpolation. In: 2024 IEEE International Conference on Robotics and Automation (ICRA): . Paper presented at IEEE International Conference on Robotics and Automation (ICRA) (pp. 9585-9592). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>SEQUEL: Semi-Supervised Preference-based RL with Query Synthesis via Latent Interpolation
2024 (English)In: 2024 IEEE International Conference on Robotics and Automation (ICRA), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 9585-9592Conference paper, Published paper (Refereed)
Abstract [en]

Preference-based reinforcement learning (RL) poses as a recent research direction in robot learning, by allowing humans to teach robots through preferences on pairs of desired behaviours. Nonetheless, to obtain realistic robot policies, an arbitrarily large number of queries is required to be answered by humans. In this work, we approach the sample-efficiency challenge by presenting a technique which synthesizes queries, in a semi-supervised learning perspective. To achieve this, we leverage latent variational autoencoder (VAE) representations of trajectory segments (sequences of state-action pairs). Our approach manages to produce queries which are closely aligned with those labeled by humans, while avoiding excessive uncertainty according to the human preference predictions as determined by reward estimations. Additionally, by introducing variation without deviating from the original human’s intents, more robust reward function representations are achieved. We compare our approach to recent state-of-the-art preference-based RL semi-supervised learning techniques. Our experimental findings reveal that we can enhance the generalization of the estimated reward function without requiring additional human intervention. Lastly, to confirm the practical applicability of our approach, we conduct experiments involving actual human users in a simulated social navigation setting. Videos of the experiments can be found at https://sites.google.com/view/rl-sequel

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-360978 (URN)10.1109/ICRA57147.2024.10610534 (DOI)001369728000064 ()2-s2.0-85199009127 (Scopus ID)
Conference
IEEE International Conference on Robotics and Automation (ICRA)
Note

Part of ISBN 9798350384574

QC 20250310

Available from: 2025-03-07 Created: 2025-03-07 Last updated: 2025-03-10Bibliographically approved
Gillet, S., Marta, D., Akif, M. & Leite, I. (2024). Shielding for Socially Appropriate Robot Listening Behaviors. In: 2024 33RD IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, ROMAN 2024: . Paper presented at 33rd IEEE International Conference on Robot and Human Interactive Communication (IEEE RO-MAN) - Embracing Human-Centered HRI, AUG 26-30, 2024, Pasadena, CA (pp. 2279-2286). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Shielding for Socially Appropriate Robot Listening Behaviors
2024 (English)In: 2024 33RD IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, ROMAN 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 2279-2286Conference paper, Published paper (Refereed)
Abstract [en]

A crucial part of traditional reinforcement learning (RL) is the initial exploration phase, in which trying available actions randomly is a critical element. As random behavior might be detrimental to a social interaction, this work proposes a novel paradigm for learning social robot behavior-the use of shielding to ensure socially appropriate behavior during exploration and learning. We explore how a data-driven approach for shielding could be used to generate listening behavior. In a video-based user study (N=110), we compare shielded exploration to two other exploration methods. We show that the shielded exploration is perceived as more comforting and appropriate than a straightforward random approach. Based on our findings, we discuss the potential for future work using shielded and socially guided approaches for learning idiosyncratic social robot behaviors through RL.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
IEEE RO-MAN, ISSN 1944-9445
National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-358780 (URN)10.1109/RO-MAN60168.2024.10731356 (DOI)001348918600302 ()2-s2.0-85209799050 (Scopus ID)
Conference
33rd IEEE International Conference on Robot and Human Interactive Communication (IEEE RO-MAN) - Embracing Human-Centered HRI, AUG 26-30, 2024, Pasadena, CA
Note

Part of ISBN 979-8-3503-7502-2

QC 20250122

Available from: 2025-01-22 Created: 2025-01-22 Last updated: 2025-01-22Bibliographically approved
Gillet, S., Marta, D., Akif, M. & Leite, I. (2024). Shielding for socially appropriate robot listening behaviors. In: 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN): . Paper presented at 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Pasadena, California, USA August 26th-30th, 2024.
Open this publication in new window or tab >>Shielding for socially appropriate robot listening behaviors
2024 (English)In: 2024 33rd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 2024Conference paper, Published paper (Refereed)
Abstract [en]

A crucial part of traditional reinforcement learning (RL) is the initial exploration phase, in which trying available actions randomly is a critical element. As random behavior might be detrimental to a social interaction, this work proposes a novel paradigm for learning social robot behavior--the use of shielding to ensure socially appropriate behavior during exploration and learning. We explore how a data-driven approach for shielding could be used to generate listening behavior. In a video-based user study (N=110), we compare shielded exploration to two other exploration methods. We show that the shielded exploration is perceived as more comforting and appropriate than a straightforward random approach. Based on our findings, we discuss the potential for future work using shielded and socially guided approaches for learning idiosyncratic social robot behaviors through RL.   

National Category
Computer graphics and computer vision
Identifiers
urn:nbn:se:kth:diva-350432 (URN)
Conference
2024 33rd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), Pasadena, California, USA August 26th-30th, 2024
Note

Paper will be published later this year (accepted camera-ready version available).

QC 20240717

Available from: 2024-07-11 Created: 2024-07-11 Last updated: 2025-02-07Bibliographically approved
Marta, D., Holk, S., Pek, C., Tumova, J. & Leite, I. (2023). Aligning Human Preferences with Baseline Objectives in Reinforcement Learning. In: 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023): . Paper presented at IEEE International Conference on Robotics and Automation (ICRA), MAY 29-JUN 02, 2023, London, ENGLAND. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Aligning Human Preferences with Baseline Objectives in Reinforcement Learning
Show others...
2023 (English)In: 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Practical implementations of deep reinforcement learning (deep RL) have been challenging due to an amplitude of factors, such as designing reward functions that cover every possible interaction. To address the heavy burden of robot reward engineering, we aim to leverage subjective human preferences gathered in the context of human-robot interaction, while taking advantage of a baseline reward function when available. By considering baseline objectives to be designed beforehand, we are able to narrow down the policy space, solely requesting human attention when their input matters the most. To allow for control over the optimization of different objectives, our approach contemplates a multi-objective setting. We achieve human-compliant policies by sequentially training an optimal policy from a baseline specification and collecting queries on pairs of trajectories. These policies are obtained by training a reward estimator to generate Pareto optimal policies that include human preferred behaviours. Our approach ensures sample efficiency and we conducted a user study to collect real human preferences, which we utilized to obtain a policy on a social navigation environment.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Series
IEEE International Conference on Robotics and Automation ICRA
National Category
Robotics and automation
Identifiers
urn:nbn:se:kth:diva-324924 (URN)10.1109/ICRA48891.2023.10161261 (DOI)001048371100079 ()2-s2.0-85164820716 (Scopus ID)
Conference
IEEE International Conference on Robotics and Automation (ICRA), MAY 29-JUN 02, 2023, London, ENGLAND
Note

Part of ISBN 979-8-3503-2365-8

QC 20230328

Available from: 2023-03-21 Created: 2023-03-21 Last updated: 2025-05-19Bibliographically approved
Marta, D., Holk, S., Pek, C., Tumova, J. & Leite, I. (2023). VARIQuery: VAE Segment-based Active Learning for Query Selection in Preference-based Reinforcement Learning. In: : . Paper presented at IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023, October 1 – 5, 2023, Detroit, Michigan, USA..
Open this publication in new window or tab >>VARIQuery: VAE Segment-based Active Learning for Query Selection in Preference-based Reinforcement Learning
Show others...
2023 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Human-in-the-loop reinforcement learning (RL) methods actively integrate human knowledge to create reward functions for various robotic tasks. Learning from preferences shows promise as alleviates the requirement of demonstrations by querying humans on state-action sequences. However, the limited granularity of sequence-based approaches complicates temporal credit assignment. The amount of human querying is contingent on query quality, as redundant queries result in excessive human involvement. This paper addresses the often-overlooked aspect of query selection, which is closely related to active learning (AL). We propose a novel query selection approach that leverages variational autoencoder (VAE) representations of state sequences. In this manner, we formulate queries that are diverse in nature while simultaneously taking into account reward model estimations. We compare our approach to the current state-of-the-art query selection methods in preference-based RL, and find ours to be either on-par or more sample efficient through extensive benchmarking on simulated environments relevant to robotics. Lastly, we conduct an online study to verify the effectiveness of our query selection approach with real human feedback and examine several metrics related to human effort.

National Category
Robotics and automation
Identifiers
urn:nbn:se:kth:diva-333948 (URN)
Conference
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2023, October 1 – 5, 2023, Detroit, Michigan, USA.
Note

QC 20230818

Available from: 2023-08-15 Created: 2023-08-15 Last updated: 2025-02-09Bibliographically approved
Marta, D., Holk, S., Pek, C., Tumova, J. & Leite, I. (2023). VARIQuery: VAE Segment-Based Active Learning for Query Selection in Preference-Based Reinforcement Learning. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023: . Paper presented at 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023, Detroit, United States of America, Oct 1 2023 - Oct 5 2023 (pp. 7878-7885). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>VARIQuery: VAE Segment-Based Active Learning for Query Selection in Preference-Based Reinforcement Learning
Show others...
2023 (English)In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 7878-7885Conference paper, Published paper (Refereed)
Abstract [en]

Human-in-the-loop reinforcement learning (RL) methods actively integrate human knowledge to create reward functions for various robotic tasks. Learning from preferences shows promise as alleviates the requirement of demonstrations by querying humans on state-action sequences. However, the limited granularity of sequence-based approaches complicates temporal credit assignment. The amount of human querying is contingent on query quality, as redundant queries result in excessive human involvement. This paper addresses the often-overlooked aspect of query selection, which is closely related to active learning (AL). We propose a novel query selection approach that leverages variational autoencoder (VAE) representations of state sequences. In this manner, we formulate queries that are diverse in nature while simultaneously taking into account reward model estimations. We compare our approach to the current state-of-the-art query selection methods in preference-based RL, and find ours to be either on-par or more sample efficient through extensive benchmarking on simulated environments relevant to robotics. Lastly, we conduct an online study to verify the effectiveness of our query selection approach with real human feedback and examine several metrics related to human effort.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
National Category
Computer Sciences Robotics and automation
Identifiers
urn:nbn:se:kth:diva-342645 (URN)10.1109/IROS55552.2023.10341795 (DOI)001136907802029 ()2-s2.0-85182523595 (Scopus ID)
Conference
2023 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2023, Detroit, United States of America, Oct 1 2023 - Oct 5 2023
Note

Part of ISBN 978-1-6654-9190-7

QC 20240126

Available from: 2024-01-25 Created: 2024-01-25 Last updated: 2025-05-19Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-3510-5481

Search in DiVA

Show all publications