kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Developing and evaluating co-speech gesture-synthesis models for embodied conversational agents
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0001-9838-8848
2021 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

 A  large part of our communication is non-verbal:   humans use non-verbal behaviors to express various aspects of our state or intent.  Embodied artificial agents, such as virtual avatars or robots, should also use non-verbal behavior for efficient and pleasant interaction. A core part of non-verbal communication is gesticulation:  gestures communicate a large share of non-verbal content. For example, around 90\% of spoken utterances in descriptive discourse are accompanied by gestures. Since gestures are important, generating co-speech gestures has been an essential task in the Human-Agent Interaction (HAI) and Computer Graphics communities for several decades.  Evaluating the gesture-generating methods has been an equally important and equally challenging part of field development. Consequently, this thesis contributes to both the development and evaluation of gesture-generation models. 

This thesis proposes three deep-learning-based gesture-generation models. The first model is deterministic and uses only audio and generates only beat gestures.  The second model is deterministic and uses both audio and text, aiming to generate meaningful gestures.  A final model uses both audio and text and is probabilistic to learn the stochastic character of human gesticulation.  The methods have applications to both virtual agents and social robots. Individual research efforts in the field of gesture generation are difficult to compare, as there are no established benchmarks.  To address this situation, my colleagues and I launched the first-ever gesture-generation challenge, which we called the GENEA Challenge.  We have also investigated if online participants are as attentive as offline participants and found that they are both equally attentive provided that they are well paid.   Finally,  we developed a  system that integrates co-speech gesture-generation models into a real-time interactive embodied conversational agent.  This system is intended to facilitate the evaluation of modern gesture generation models in interaction. 

To further advance the development of capable gesture-generation methods, we need to advance their evaluation, and the research in the thesis supports an interpretation that evaluation is the main bottleneck that limits the field.  There are currently no comprehensive co-speech gesture datasets, which should be large, high-quality, and diverse. In addition, no strong objective metrics are yet available.  Creating speech-gesture datasets and developing objective metrics are highlighted as essential next steps for further field development.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2021. , p. 47
Series
TRITA-EECS-AVL ; 2021:75
Keywords [en]
Human-agent interaction, gesture generation, social robotics, conversational agents, non-verbal behavior, deep learning, machine learning
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-304618ISBN: 978-91-8040-058-9 (print)OAI: oai:DiVA.org:kth-304618DiVA, id: diva2:1609615
Public defence
2021-12-07, Sal Kollegiesalen, Stockholm, 13:00 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20211109

Available from: 2021-11-10 Created: 2021-11-08 Last updated: 2022-06-25Bibliographically approved
List of papers
1. Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation
Open this publication in new window or tab >>Moving Fast and Slow: Analysis of Representations and Post-Processing in Speech-Driven Automatic Gesture Generation
Show others...
2021 (English)In: International Journal of Human-Computer Interaction, ISSN 1044-7318, E-ISSN 1532-7590, Vol. 37, no 14, p. 1300-1316Article in journal (Refereed) Published
Abstract [en]

This paper presents a novel framework for speech-driven gesture production, applicable to virtual agents to enhance human-computer interaction. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates. We provide an analysis of different representations for the input (speech) and the output (motion) of the network by both objective and subjective evaluations. We also analyze the importance of smoothing of the produced motion. Our results indicated that the proposed method improved on our baseline in terms of objective measures. For example, it better captured the motion dynamics and better matched the motion-speed distribution. Moreover, we performed user studies on two different datasets. The studies confirmed that our proposed method is perceived as more natural than the baseline, although the difference in the studies was eliminated by appropriate post-processing: hip-centering and smoothing. We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.

Place, publisher, year, edition, pages
Informa UK Limited, 2021
Keywords
Gesture generation, representation learning, neural network, deep learning, virtual agents, non-verbal behavior
National Category
Human Computer Interaction
Research subject
Computer Science; Computer Science
Identifiers
urn:nbn:se:kth:diva-290787 (URN)10.1080/10447318.2021.1883883 (DOI)000619086000001 ()2-s2.0-85100955521 (Scopus ID)
Funder
Swedish Foundation for Strategic Research , RIT15-0107Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20211109

Available from: 2021-02-22 Created: 2021-02-22 Last updated: 2022-06-25Bibliographically approved
2. Gesticulator: A framework for semantically-aware speech-driven gesture generation
Open this publication in new window or tab >>Gesticulator: A framework for semantically-aware speech-driven gesture generation
Show others...
2020 (English)In: ICMI '20: Proceedings of the 2020 International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2020Conference paper, Published paper (Refereed)
Abstract [en]

During speech, people spontaneously gesticulate, which plays akey role in conveying information. Similarly, realistic co-speechgestures are crucial to enable natural and smooth interactions withsocial agents. Current end-to-end co-speech gesture generationsystems use a single modality for representing speech: either au-dio or text. These systems are therefore confined to producingeither acoustically-linked beat gestures or semantically-linked ges-ticulation (e.g., raising a hand when saying “high”): they cannotappropriately learn to generate both gesture types. We present amodel designed to produce arbitrary beat and semantic gesturestogether. Our deep-learning based model takes both acoustic andsemantic representations of speech as input, and generates gesturesas a sequence of joint angle rotations as output. The resulting ges-tures can be applied to both virtual agents and humanoid robots.Subjective and objective evaluations confirm the success of ourapproach. The code and video are available at the project page svito-zar.github.io/gesticula

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Keywords
Gesture generation; virtual agents; socially intelligent systems; co-speech gestures; multi-modal interaction; deep learning
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-286282 (URN)10.1145/3382507.3418815 (DOI)001437992100029 ()2-s2.0-85096710861 (Scopus ID)
Conference
ICMI '20: INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION Virtual Event Netherlands October 25 - 29, 2020
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

ICMI 2020 Best Paper Award

Part of Proceedings: ISBN 978-1-4503-7581-8

QC 20211109

Available from: 2020-11-24 Created: 2020-11-24 Last updated: 2025-12-05Bibliographically approved
3. Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech
Open this publication in new window or tab >>Speech2Properties2Gestures: Gesture-Property Prediction as a Tool for Generating Representational Gestures from Speech
Show others...
2021 (English)In: IVA '21: Proceedings of the 21st ACM International Conference on Intelligent Virtual Agents, Association for Computing Machinery (ACM) , 2021, p. 145-147Conference paper, Published paper (Refereed)
Abstract [en]

We propose a new framework for gesture generation, aiming to allow data-driven approaches to produce more semantically rich gestures. Our approach first predicts whether to gesture, followed by a prediction of the gesture properties. Those properties are then used as conditioning for a modern probabilistic gesture-generation model capable of high-quality output. This empowers the approach to generate gestures that are both diverse and representational. Follow-ups and more information can be found on the project page:https://svito-zar.github.io/speech2properties2gestures

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Keywords
gesture generation, virtual agents, representational gestures
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-302667 (URN)10.1145/3472306.3478333 (DOI)000728149900023 ()2-s2.0-85113524837 (Scopus ID)
Conference
21st ACM International Conference on Intelligent Virtual Agents, IVA 2021Virtual, Online14 September 2021 through 17 September 2021, University of Fukuchiyama, Fukuchiyama City, Kyoto, Japan
Funder
Swedish Foundation for Strategic Research , RIT15-0107Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20211102

Part of Proceedings: ISBN 9781450386197

Available from: 2021-09-28 Created: 2021-09-28 Last updated: 2022-06-25Bibliographically approved
4. Can we trust online crowdworkers? : Comparing online and offline participants in a preference test of virtual agents.
Open this publication in new window or tab >>Can we trust online crowdworkers? : Comparing online and offline participants in a preference test of virtual agents.
2020 (English)In: IVA '20: Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents, Association for Computing Machinery (ACM) , 2020Conference paper, Published paper (Refereed)
Abstract [en]

Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab)and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to run in-lab experiments, because of, for example, a pandemic. Crowd-sourcing platforms such as Amazon Mechanical Turk (AMT) or prolific can therefore be a suitable alternative to run certain experiments, such as evaluating virtual agents. Although previous studies investigated the use of crowdsourcing platforms for running experiments, there is still uncertainty as to whether the results are reliable for perceptual studies. Here we replicate a previous experiment where participants evaluated a gesture generation model for virtual agents. The experiment is conducted across three participant poolsś in-lab, Prolific, andAMTś having similar demographics across the in-lab participants and the Prolific platform. Our results show no difference between the three participant pools in regards to their evaluations of the gesture generation models and their reliability scores. The results indicate that online platforms can successfully be used for perceptual evaluations of this kind.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2020
Keywords
user studies, online participants, attentiveness
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-290562 (URN)10.1145/3383652.3423860 (DOI)000728153600002 ()2-s2.0-85096979963 (Scopus ID)
Conference
IVA '20: ACM International Conference on Intelligent Virtual Agents, Virtual Event, Scotland, UK, October 20-22, 2020
Funder
Swedish Foundation for Strategic Research , RIT15-0107Wallenberg AI, Autonomous Systems and Software Program (WASP), CorSA
Note

OQ 20211109

Part of Proceedings: ISBN 978-145037586-3

Taras Kucherenko and Patrik Jonell contributed equally to this research.

Available from: 2021-02-18 Created: 2021-02-18 Last updated: 2022-06-25Bibliographically approved
5. A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020
Open this publication in new window or tab >>A large, crowdsourced evaluation of gesture generation systems on common data: The GENEA Challenge 2020
Show others...
2021 (English)In: Proceedings IUI '21: 26th International Conference on Intelligent User Interfaces, Association for Computing Machinery (ACM) , 2021, p. 11-21Conference paper, Published paper (Refereed)
Abstract [en]

Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2021
Keywords
gesture generation, conversational agents, evaluation paradigms
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-296490 (URN)10.1145/3397481.3450692 (DOI)000747690200006 ()2-s2.0-85102546745 (Scopus ID)
Conference
IUI '21: 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, April 13-17, 2021
Funder
Swedish Foundation for Strategic Research , RIT15-0107Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

Part of Proceedings: ISBN 978-145038017-1

QC 20220303

Available from: 2021-06-05 Created: 2021-06-05 Last updated: 2022-06-25Bibliographically approved
6. A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents
Open this publication in new window or tab >>A Framework for Integrating Gesture Generation Models into Interactive Conversational Agents
Show others...
2021 (English)Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Embodied conversational agents (ECAs) benefit from non-verbal behavior for natural and efficient interaction with users. Gesticulation – hand and arm movements accompanying speech – is an essential part of non-verbal behavior. Gesture generation models have been developed for several decades: starting with rule-based and ending with mainly data-driven methods. To date, recent end-to-end gesture generation methods have not been evaluated in areal-time interaction with users. We present a proof-of-concept

framework, which is intended to facilitate evaluation of modern gesture generation models in interaction. We demonstrate an extensible open-source framework that contains three components: 1) a 3D interactive agent; 2) a chatbot back-end; 3) a gesticulating system. Each component can be replaced,

making the proposed framework applicable for investigating the effect of different gesturing models in real-time interactions with different communication modalities, chatbot backends, or different agent appearances. The code and video are available at the project page https://nagyrajmund.github.io/project/gesturebot.

Keywords
conversational embodied agents; non-verbal behavior synthesis
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-304616 (URN)
Conference
20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS).
Funder
Swedish Foundation for Strategic Research, RIT15-0107
Note

QC 20211130

Not duplicate with DiVA 1653872

Available from: 2021-11-08 Created: 2021-11-08 Last updated: 2022-06-25Bibliographically approved

Open Access in DiVA

fulltext(1397 kB)879 downloads
File information
File name FULLTEXT01.pdfFile size 1397 kBChecksum SHA-512
c2dc0536f5da3abe620c2c731d43d454a74f4c8ecf9feaf93734ac227018331786e20462cf2a2623b2047b8fd676098ec2f94d93a4efabff95557109eb6d6ba5
Type fulltextMimetype application/pdf

Authority records

Kucherenko, Taras

Search in DiVA

By author/editor
Kucherenko, Taras
By organisation
Robotics, Perception and Learning, RPL
Human Computer Interaction

Search outside of DiVA

GoogleGoogle Scholar
Total: 892 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 4446 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf