Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-7801-7617
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-9838-8848
KTH, Superseded Departments (pre-2005), Speech, Music and Hearing.ORCID iD: 0000-0003-1399-6604
2020 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

Place, publisher, year, edition, pages
2020.
National Category
Computer and Information Sciences
Identifiers
URN: urn:nbn:se:kth:diva-268363OAI: oai:DiVA.org:kth-268363DiVA, id: diva2:1394216
Conference
EUROGRAPHICS 2020
Note

QCR 20200513

Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-05-13Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Conference webpage

Authority records BETA

Alexanderson, SimonHenter, Gustav EjeKucherenko, TarasBeskow, Jonas

Search in DiVA

By author/editor
Alexanderson, SimonHenter, Gustav EjeKucherenko, TarasBeskow, Jonas
By organisation
Speech, Music and Hearing, TMHRobotics, Perception and Learning, RPLSpeech, Music and Hearing
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 312 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf