kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-9838-8848
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-3687-6189
ETRI & KAIST, Daejeon, Republic of Korea.ORCID iD: 0000-0003-4286-3421
IDLab, Ghent University – imec, Ghent, Belgium.ORCID iD: 0000-0002-7420-7181
Show others and affiliations
2021 (English)In: Proceedings IUI '21: 26th International Conference on Intelligent User Interfaces, Association for Computing Machinery (ACM) , 2021, p. 11-21Conference paper, Published paper (Refereed)
Abstract [en]

Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM) , 2021. p. 11-21
Keywords [en]
gesture generation, conversational agents, evaluation paradigms
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
URN: urn:nbn:se:kth:diva-296490DOI: 10.1145/3397481.3450692ISI: 000747690200006Scopus ID: 2-s2.0-85102546745OAI: oai:DiVA.org:kth-296490DiVA, id: diva2:1561006
Conference
IUI '21: 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, April 13-17, 2021
Funder
Swedish Foundation for Strategic Research , RIT15-0107Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

Part of Proceedings: ISBN 978-145038017-1

QC 20220303

Available from: 2021-06-05 Created: 2021-06-05 Last updated: 2022-06-25Bibliographically approved
In thesis
1. Developing and evaluating co-speech gesture-synthesis models for embodied conversational agents
Open this publication in new window or tab >>Developing and evaluating co-speech gesture-synthesis models for embodied conversational agents
2021 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

 A  large part of our communication is non-verbal:   humans use non-verbal behaviors to express various aspects of our state or intent.  Embodied artificial agents, such as virtual avatars or robots, should also use non-verbal behavior for efficient and pleasant interaction. A core part of non-verbal communication is gesticulation:  gestures communicate a large share of non-verbal content. For example, around 90\% of spoken utterances in descriptive discourse are accompanied by gestures. Since gestures are important, generating co-speech gestures has been an essential task in the Human-Agent Interaction (HAI) and Computer Graphics communities for several decades.  Evaluating the gesture-generating methods has been an equally important and equally challenging part of field development. Consequently, this thesis contributes to both the development and evaluation of gesture-generation models. 

This thesis proposes three deep-learning-based gesture-generation models. The first model is deterministic and uses only audio and generates only beat gestures.  The second model is deterministic and uses both audio and text, aiming to generate meaningful gestures.  A final model uses both audio and text and is probabilistic to learn the stochastic character of human gesticulation.  The methods have applications to both virtual agents and social robots. Individual research efforts in the field of gesture generation are difficult to compare, as there are no established benchmarks.  To address this situation, my colleagues and I launched the first-ever gesture-generation challenge, which we called the GENEA Challenge.  We have also investigated if online participants are as attentive as offline participants and found that they are both equally attentive provided that they are well paid.   Finally,  we developed a  system that integrates co-speech gesture-generation models into a real-time interactive embodied conversational agent.  This system is intended to facilitate the evaluation of modern gesture generation models in interaction. 

To further advance the development of capable gesture-generation methods, we need to advance their evaluation, and the research in the thesis supports an interpretation that evaluation is the main bottleneck that limits the field.  There are currently no comprehensive co-speech gesture datasets, which should be large, high-quality, and diverse. In addition, no strong objective metrics are yet available.  Creating speech-gesture datasets and developing objective metrics are highlighted as essential next steps for further field development.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2021. p. 47
Series
TRITA-EECS-AVL ; 2021:75
Keywords
Human-agent interaction, gesture generation, social robotics, conversational agents, non-verbal behavior, deep learning, machine learning
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-304618 (URN)978-91-8040-058-9 (ISBN)
Public defence
2021-12-07, Sal Kollegiesalen, Stockholm, 13:00 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20211109

Available from: 2021-11-10 Created: 2021-11-08 Last updated: 2022-06-25Bibliographically approved

Open Access in DiVA

fulltext(920 kB)186 downloads
File information
File name FULLTEXT01.pdfFile size 920 kBChecksum SHA-512
826944bcb8100d141a62168edadb8ed6ec9a94967218c4efd96e93f67e0d53a4aff460b24aaee9370f19df8c68026d658939b788c9b6d50d7a4efc76a8f14b72
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopusOpen Access Paper

Authority records

Kucherenko, TarasJonell, PatrikHenter, Gustav Eje

Search in DiVA

By author/editor
Kucherenko, TarasJonell, PatrikYoon, YoungwooWolfert, PieterHenter, Gustav Eje
By organisation
Robotics, Perception and Learning, RPLSpeech, Music and Hearing, TMH
Human Computer Interaction

Search outside of DiVA

GoogleGoogle Scholar
Total: 187 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 282 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf