kth.sePublikationer KTH
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.ORCID-id: 0000-0001-9838-8848
KTH, Skolan för elektroteknik och datavetenskap (EECS), Intelligenta system, Robotik, perception och lärande, RPL.ORCID-id: 0000-0002-1643-1054
Visa övriga samt affilieringar
2021 (Engelska)Ingår i: International Journal of Human-Computer Interaction, ISSN 1044-7318, E-ISSN 1532-7590, Vol. 37, nr 14, s. 1300-1316Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyze the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.

Ort, förlag, år, upplaga, sidor
Informa UK Limited , 2021. Vol. 37, nr 14, s. 1300-1316
Nyckelord [en]
Gesture generation, representation learning, neural network, deep learning, virtual agents, non-verbal behavior
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign)
Forskningsämne
Datalogi; Datalogi
Identifikatorer
URN: urn:nbn:se:kth:diva-290787DOI: 10.1080/10447318.2021.1883883ISI: 000619086000001Scopus ID: 2-s2.0-85100955521OAI: oai:DiVA.org:kth-290787DiVA, id: diva2:1530447
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), RIT15-0107Wallenberg AI, Autonomous Systems and Software Program (WASP)
Anmärkning

QC 20211109

Tillgänglig från: 2021-02-22 Skapad: 2021-02-22 Senast uppdaterad: 2022-06-25Bibliografiskt granskad
Ingår i avhandling
1. Developing and evaluating co-speech gesture-synthesis models for embodied conversational agents
Öppna denna publikation i ny flik eller fönster >>Developing and evaluating co-speech gesture-synthesis models for embodied conversational agents
2021 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

 A  large part of our communication is non-verbal:   humans use non-verbal behaviors to express various aspects of our state or intent.  Embodied artificial agents, such as virtual avatars or robots, should also use non-verbal behavior for efficient and pleasant interaction. A core part of non-verbal communication is gesticulation:  gestures communicate a large share of non-verbal content. For example, around 90\% of spoken utterances in descriptive discourse are accompanied by gestures. Since gestures are important, generating co-speech gestures has been an essential task in the Human-Agent Interaction (HAI) and Computer Graphics communities for several decades.  Evaluating the gesture-generating methods has been an equally important and equally challenging part of field development. Consequently, this thesis contributes to both the development and evaluation of gesture-generation models. 

This thesis proposes three deep-learning-based gesture-generation models. The first model is deterministic and uses only audio and generates only beat gestures.  The second model is deterministic and uses both audio and text, aiming to generate meaningful gestures.  A final model uses both audio and text and is probabilistic to learn the stochastic character of human gesticulation.  The methods have applications to both virtual agents and social robots. Individual research efforts in the field of gesture generation are difficult to compare, as there are no established benchmarks.  To address this situation, my colleagues and I launched the first-ever gesture-generation challenge, which we called the GENEA Challenge.  We have also investigated if online participants are as attentive as offline participants and found that they are both equally attentive provided that they are well paid.   Finally,  we developed a  system that integrates co-speech gesture-generation models into a real-time interactive embodied conversational agent.  This system is intended to facilitate the evaluation of modern gesture generation models in interaction. 

To further advance the development of capable gesture-generation methods, we need to advance their evaluation, and the research in the thesis supports an interpretation that evaluation is the main bottleneck that limits the field.  There are currently no comprehensive co-speech gesture datasets, which should be large, high-quality, and diverse. In addition, no strong objective metrics are yet available.  Creating speech-gesture datasets and developing objective metrics are highlighted as essential next steps for further field development.

Ort, förlag, år, upplaga, sidor
KTH Royal Institute of Technology, 2021. s. 47
Serie
TRITA-EECS-AVL ; 2021:75
Nyckelord
Human-agent interaction, gesture generation, social robotics, conversational agents, non-verbal behavior, deep learning, machine learning
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign)
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-304618 (URN)978-91-8040-058-9 (ISBN)
Disputation
2021-12-07, Sal Kollegiesalen, Stockholm, 13:00 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), RIT15-0107
Anmärkning

QC 20211109

Tillgänglig från: 2021-11-10 Skapad: 2021-11-08 Senast uppdaterad: 2022-06-25Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltextScopus

Person

Kucherenko, TarasHenter, Gustav EjeKjellström, Hedvig

Sök vidare i DiVA

Av författaren/redaktören
Kucherenko, TarasHenter, Gustav EjeKjellström, Hedvig
Av organisationen
Robotik, perception och lärande, RPL
I samma tidskrift
International Journal of Human-Computer Interaction
Människa-datorinteraktion (interaktionsdesign)

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 338 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf