Compared to text-driven diffusion-model-based image generation, editing is challenging due to the necessity for localized modifications. The complexity is further amplified for videos due to the additional requirement for temporal consistency. These challenges are highlighted when editing real-world videos, which, unlike synthetic videos, often feature factors like moving camera views and occlusions. In this paper, we propose RealCraft, an attention-control-based method for zero-shot real-world video editing. By swapping cross-attention in order to inject new features during the editing process and to relax the spatial-temporal attention of the edited object, we achieve localized shape-wise edits along with enhanced temporal consistency. Our model directly uses Stable Diffusion and operates without the need for additional networks or parameters. We showcase the proposed zero-shot attention-control-based method across a range of videos, demonstrating shape-wise, time-consistent and parameter-free editing in videos of up to 64 frames.
Part of ISBN 9789819543779
QC 20251215