kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Feature Extractor or Decision Maker: Rethinking the Role of Visual Encoders in Visuomotor Policies
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0009-0008-7672-970X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-6632-3342
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-0611-4239
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Collaborative Autonomous Systems.
Show others and affiliations
2025 (English)In: 2025 IEEE International Conference on Robotics and Automation, ICRA 2025, Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 3654-3661Conference paper, Published paper (Refereed)
Abstract [en]

An end-to-end (E2E) visuomotor policy is typically treated as a unified whole, but recent approaches using out-of-domain (OOD) data to pretrain the visual encoder have cleanly separated the visual encoder from the network, with the remainder referred to as the policy. We propose Visual Alignment Testing, an experimental framework designed to evaluate the validity of this functional separation. Our results indicate that in E2E-trained models, visual encoders actively contribute to decision-making resulting from motor data supervision, contradicting the assumed functional separation. In contrast, OOD-pretrained models, where encoders lack this capability, experience an average performance drop of 42% in our benchmark results, compared to the state-of-the-art performance achieved by E2E policies. We believe this initial exploration of visual encoders' role can provide a first step towards guiding future pretraining methods to address their decision-making ability, such as developing task-conditioned or context-aware encoders.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2025. p. 3654-3661
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-371368DOI: 10.1109/ICRA55743.2025.11127332ISI: 001582497400330Scopus ID: 2-s2.0-105016697318OAI: oai:DiVA.org:kth-371368DiVA, id: diva2:2006400
Conference
2025 IEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, United States of America, May 19 2025 - May 23 2025
Note

Part of ISBN 9798331541392

QC 20251014

Available from: 2025-10-14 Created: 2025-10-14 Last updated: 2026-02-04Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Wang, RuiyuZhuang, ZheyuJin, ShutongIngelhag, NilsKragic Jensfelt, DanicaPokorny, Florian T.

Search in DiVA

By author/editor
Wang, RuiyuZhuang, ZheyuJin, ShutongIngelhag, NilsKragic Jensfelt, DanicaPokorny, Florian T.
By organisation
Robotics, Perception and Learning, RPLCollaborative Autonomous Systems
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 62 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf