kth.sePublikationer KTH
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Emotional Speech-Driven Animation with Content-Emotion Disentanglement
Max Planck Institute for Intelligent Systems, Germany.
KTH, Skolan för elektroteknik och datavetenskap (EECS), Datavetenskap, Beräkningsvetenskap och beräkningsteknik (CST).ORCID-id: 0000-0002-7414-845X
Max Planck Institute for Intelligent Systems, Germany.
Max Planck Institute for Intelligent Systems, Germany.
Visa övriga samt affilieringar
2023 (Engelska)Ingår i: Proceedings - SIGGRAPH Asia 2023 Conference Papers, SA 2023, Association for Computing Machinery (ACM) , 2023, artikel-id 41Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM) , 2023. artikel-id 41
Nyckelord [en]
Computer Graphics, Computer Vision, Deep learning, Facial Animation, Speech-driven Animation
Nationell ämneskategori
Datorgrafik och datorseende
Identifikatorer
URN: urn:nbn:se:kth:diva-347500DOI: 10.1145/3610548.3618183ISI: 001278296700041Scopus ID: 2-s2.0-85180390692OAI: oai:DiVA.org:kth-347500DiVA, id: diva2:1873752
Konferens
2023 SIGGRAPH Asia 2023 Conference Papers, SA 2023, Sydney, Australia, Dec 12 2023 - Dec 15 2023
Anmärkning

Part of ISBN 9798400703157

QC 20240619

Tillgänglig från: 2024-06-19 Skapad: 2024-06-19 Senast uppdaterad: 2025-02-07Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopus

Person

Chhatre, Kiran

Sök vidare i DiVA

Av författaren/redaktören
Chhatre, Kiran
Av organisationen
Beräkningsvetenskap och beräkningsteknik (CST)
Datorgrafik och datorseende

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 134 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf