kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
RealCraft: Attention Control as A Tool for Zero-Shot Consistent Video Editing
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-0611-4239
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0009-0008-7672-970X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0003-1114-6040
2026 (English)In: Neural Information Processing - 32nd International Conference, ICONIP 2025, Proceedings, Springer Nature , 2026, p. 137-152Conference paper, Published paper (Refereed)
Abstract [en]

Compared to text-driven diffusion-model-based image generation, editing is challenging due to the necessity for localized modifications. The complexity is further amplified for videos due to the additional requirement for temporal consistency. These challenges are highlighted when editing real-world videos, which, unlike synthetic videos, often feature factors like moving camera views and occlusions. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention in order to inject new features during the editing process and to relax the spatial-temporal attention of the edited object, we achieve localized shape-wise edits along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional networks or parameters. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.

Place, publisher, year, edition, pages
Springer Nature , 2026. p. 137-152
Keywords [en]
Text-guided video editing, Zero-shot
National Category
Computer graphics and computer vision Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:kth:diva-373855DOI: 10.1007/978-981-95-4378-6_10Scopus ID: 2-s2.0-105022815736OAI: oai:DiVA.org:kth-373855DiVA, id: diva2:2021624
Conference
32nd International Conference on Neural Information Processing, ICONIP 2025, Okinawa, Japan, November 20-24, 2025
Note

Part of ISBN 9789819543779

QC 20251215

Available from: 2025-12-15 Created: 2025-12-15 Last updated: 2025-12-15Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Jin, ShutongWang, RuiyuPokorny, Florian T.

Search in DiVA

By author/editor
Jin, ShutongWang, RuiyuPokorny, Florian T.
By organisation
Robotics, Perception and Learning, RPL
Computer graphics and computer visionProbability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 70 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf