kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
S2-Diffusion: Generalizing from Instance-level to Category-level Skills in Robot Manipulation
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-5655-0990
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL. INCAR Robotics AB, Sweden.ORCID iD: 0000-0003-3827-3824
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-2965-2953
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-7248-1112
2025 (English)In: IEEE Robotics and Automation Letters, E-ISSN 2377-3766, Vol. 10, no 12, p. 12995-13002Article in journal (Refereed) Published
Abstract [en]

Recent advances in skill learning has propelled robot manipulation to new heights by enabling it to learn complex manipulation tasks from a practical number of demonstrations. However, these skills are often limited to the particular action, object, and environment instances that are shown in the training data, and have trouble transferring to other instances of the same category. In this work we present an open-vocabulary Spatial-Semantic Diffusion policy (S2-Diffusion) which enables generalization from instance-level training data to category-level, enabling skills to be transferable between instances of the same category. We show that functional aspects of skills can be captured via a promptable semantic module combined with a spatial representation. We further propose leveraging depth estimation networks to allow the use of only a single RGB camera. Our approach is evaluated and compared on a diverse number of robot manipulation tasks, both in simulation and in the real world. Our results show that S2-Diffusion is invariant to changes in category-irrelevant factors as well as enables satisfying performance on other instances within the same category, even if it was not trained on that specific instance. Project website: https://s2-diffusion.github.io.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2025. Vol. 10, no 12, p. 12995-13002
Keywords [en]
Deep Learning in Grasping and Manipulation, Imitation Learning, Learning from Demonstration
National Category
Robotics and automation Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:kth:diva-372570DOI: 10.1109/LRA.2025.3625497ISI: 001611073100009Scopus ID: 2-s2.0-105019658163OAI: oai:DiVA.org:kth-372570DiVA, id: diva2:2012846
Note

QC 20260120

Available from: 2025-11-10 Created: 2025-11-10 Last updated: 2026-01-20Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Yang, QuantaoWelle, Michael C.Kragic Jensfelt, DanicaAndersson, Olov

Search in DiVA

By author/editor
Yang, QuantaoWelle, Michael C.Kragic Jensfelt, DanicaAndersson, Olov
By organisation
Robotics, Perception and Learning, RPL
In the same journal
IEEE Robotics and Automation Letters
Robotics and automationComputer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 38 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf