Open this publication in new window or tab >>Show others...
2024 (English)In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1952-1964Conference paper, Published paper (Refereed)
Abstract [en]
Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.
Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
National Category
Signal Processing Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Note
Part of ISBN 9798350365474
QC 20250715
2025-07-152025-07-152025-08-13Bibliographically approved