kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
PACA: Perspective-Aware Cross-Attention Representation for Zero-shot Scene Rearrangement
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-0611-4239
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0009-0008-7672-970X
Graz Univ Technol, Graz, Austria.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-1114-6040
2025 (English)In: Proceedings IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025, Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 6559-6569Conference paper, Published paper (Refereed)
Abstract [en]

Scene rearrangement, like table tidying, is a challenging task in robotic manipulation due to the complexity of predicting diverse object arrangements. Web-scale trained generative models such as Stable Diffusion [52] can aid by generating natural scenes as goals. To facilitate robot execution, object-level representations must be extracted to match the real scenes with the generated goals and to calculate object pose transformations. Current methods typically use a multi-step design that involves separate models for generation, segmentation, and feature encoding, which can lead to a low success rate due to error accumulation. Furthermore, they lack control over the viewing perspectives of the generated goals, restricting the tasks to 3-DoF settings. In this paper, we propose PACA, a zero-shot pipeline for scene rearrangement that leverages perspective-aware cross-attention representation derived from Stable Diffusion. Specifically, we develop an object-level representation that integrates generation, segmentation, and feature encoding into a single step. Additionally, we introduce perspective control, thus enabling the matching of 6-DoF camera views and extending past approaches that were limited to 3-DoF top-down settings. The efficacy of our method is demonstrated through its zero-shot performance in real robot experiments across various scenes, achieving an average matching accuracy and execution success rate of 87% and 67%, respectively.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2025. p. 6559-6569
Series
IEEE Winter Conference on Applications of Computer Vision, ISSN 2472-6737
National Category
Computer graphics and computer vision
Identifiers
URN: urn:nbn:se:kth:diva-374028DOI: 10.1109/WACV61041.2025.00639ISI: 001521272600150Scopus ID: 2-s2.0-105003628116OAI: oai:DiVA.org:kth-374028DiVA, id: diva2:2023038
Conference
2025 Winter Conference on Applications of Computer Vision-WACV, FEB 28-MAR 04, 2025, Tucson, AZ
Note

Part of ISBN 979-8-3315-1084-8; 979-8-3315-1083-1

QC 20251218

Available from: 2025-12-18 Created: 2025-12-18 Last updated: 2025-12-20Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Jin, ShutongWang, RuiyuPokorny, Florian T.

Search in DiVA

By author/editor
Jin, ShutongWang, RuiyuPokorny, Florian T.
By organisation
Robotics, Perception and Learning, RPL
Computer graphics and computer vision

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 55 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf