kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
In-game voice2face model with knowledge distillation: Knowledge distillation for 3D speech-driven facial animation with lip-sync
KTH, School of Electrical Engineering and Computer Science (EECS).
2024 (English)Independent thesis Advanced level (degree of Master (Two Years)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Voice2Face is a speech-driven 3D facial animation model that generates corresponding lip-sync animations from speech audio. However, the high computational and memory costs of the Voice2Face model pose challenges for real-time applications in games. This thesis investigates the potential of compressing the Voice2Face model using knowledge distillation techniques. We analyzed the architecture of Voice2Face and designed three smaller model candidates with different structures and a training framework that does not require ground truth. We then trained them on a public speech dataset LibriSpeech and evaluated their performance against the original model. Both qualitative and quantitative results indicate that all three candidates demonstrate competitive performance, with one model being sufficiently compact for real-time in-game use. Additionally, we propose a method to further reduce model latency.

Abstract [sv]

Voice2Face är en taldriven 3D-ansiktsanimationsmodell som genererar motsvarande läppsynkroniseringsanimationer från talljud. De höga beräknings- och minneskostnaderna för Voice2Face-modellen innebär dock utmaningar för realtidsapplikationer i spel. Denna avhandling undersöker potentialen i att komprimera Voice2Face- modellen med hjälp av kunskapsdestillationstekniker. Vi analyserade arkitekturen för Voice2Face och designade tre mindre modellkandidater med olika strukturer, tillsammans med ett utbildningsramverk som inte kräver grundsanning. Vi utbildade dem sedan i ett offentligt taldataset LibriSpeech och utvärderade deras prestanda mot den ursprungliga modellen. Både kvalitativa och kvantitativa resultat indikerar att alla tre kandidaterna visar konkurrenskraftiga prestanda, med en modell som är tillräckligt kompakt för realtidsanvändning i spelet. Dessutom föreslår vi en metod för att ytterligare minska modellfördröjningen.

Place, publisher, year, edition, pages
2024. , p. 42
Series
TRITA-EECS-EX ; 2024:761
Keywords [en]
real-time 3D lip-sync animation, cVAE, knowledge distillation
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-360152OAI: oai:DiVA.org:kth-360152DiVA, id: diva2:1938578
External cooperation
Electronic Arts AB
Supervisors
Examiners
Available from: 2025-02-24 Created: 2025-02-18 Last updated: 2025-02-24Bibliographically approved

Open Access in DiVA

fulltext(1138 kB)47 downloads
File information
File name FULLTEXT01.pdfFile size 1138 kBChecksum SHA-512
58ca4ffffaa13a5939fd7296f55716b63ac63a1e42c72afa2054c10fd668472c93038a576810f3a29581ea3dcd0bc8a904a93af83632a13ae636b01b7165749a
Type fulltextMimetype application/pdf

By organisation
School of Electrical Engineering and Computer Science (EECS)
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 47 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 276 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf