Animations produced by generative models are often evaluated using objective quantitative metrics that do not fully capture perceptual effects in immersive virtual environments. To address this gap, we present a preliminary perceptual evaluation of generative models for animation synthesis, conducted via a VR-based user study (N = 48). Our investigation specifically focuses on animation congruency—ensuring that generated facial expressions and body gestures are both congruent with and synchronized to driving speech. We evaluated two state-of-the-art methods: a speech-driven full-body animation model and a video-driven full-body reconstruction model, assessing their capability to produce congruent facial expressions and body gestures. Our results demonstrate a strong user preference for combined facial and body animations, highlighting that congruent multimodal animations significantly enhance perceived realism compared to animations featuring only a single modality. By incorporating VR-based perceptual feedback into training pipelines, our approach provides a foundation for developing more engaging and responsive virtual characters.
Part of ISBN 9798400718335
QC 20250509