kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Computational Science and Technology (CST).ORCID iD: 0000-0002-7414-845X
Max Planck Institute for Intelligent Systems, Germany.ORCID iD: 0000-0002-1651-030X
Max Planck Institute for Intelligent Systems, Germany.
Max Planck Institute for Intelligent Systems, Germany.
Show others and affiliations
2024 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1942-1953Conference paper, Published paper (Refereed)
Abstract [en]

Existing methods for synthesizing 3D human gestures from speech have shown promising results but they do not explicitly model the impact of emotions on the generated gestures. Instead these methods directly output animations from speech without control over the expressed emotion. To address this limitation we present AMUSE an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e. gestures related to speech rhythm and word utterances) emotion and personal style are separable. To account for this AMUSE maps the driving audio to three disentangled latent vectors: one for content one for emotion and one for personal style. A latent diffusion model trained to generate gesture motion sequences is then conditioned on these latent vectors. Once trained AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative quantitative and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024. p. 1942-1953
National Category
Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-354048DOI: 10.1109/CVPR52733.2024.00190ISI: 001322555902029Scopus ID: 2-s2.0-85202286367OAI: oai:DiVA.org:kth-354048DiVA, id: diva2:1901299
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 16-22 2024, Seattle, WA, USA
Note

Part of ISBN 979-8-3503-5300-6

QC 20240930

Available from: 2024-09-26 Created: 2024-09-26 Last updated: 2025-01-20Bibliographically approved

Open Access in DiVA

Pdf(936 kB)95 downloads
File information
File name FULLTEXT01.pdfFile size 936 kBChecksum SHA-512
6be60d70db2fedd6d86f18e403bf890887774d1664ff85b3dea8553a9e234e32f91948c90c42a268a887ca7ced4140a9fc1e1a64caf031ffd07e4538a607e1d2
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Chhatre, KiranPeters, Christopher

Search in DiVA

By author/editor
Chhatre, KiranDaněček, RadekPeters, ChristopherBolkart, Timo
By organisation
Computational Science and Technology (CST)
Electrical Engineering, Electronic Engineering, Information Engineering

Search outside of DiVA

GoogleGoogle Scholar
Total: 96 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 93 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf