Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (4 of 4) Show all publications
Kucherenko, T., Hasegawa, D., Henter, G. E., Kaneko, N. & Kjellström, H. (2019). Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In: 19th ACM International Conference on Intelligent Virtual Agents: . Paper presented at 19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France. New York, NY, USA: ACM Publications
Open this publication in new window or tab >>Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Show others...
2019 (English)In: 19th ACM International Conference on Intelligent Virtual Agents, New York, NY, USA: ACM Publications, 2019Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
New York, NY, USA: ACM Publications, 2019
Keywords
Gesture generation, social robotics, representation learning, neural network, deep learning, gesture synthesis, virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-255035 (URN)10.1145/3308532.3329472 (DOI)2-s2.0-85069654899 (Scopus ID)978-1-4503-6672-4 (ISBN)
Conference
19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190902

Available from: 2019-07-16 Created: 2019-07-16 Last updated: 2019-09-02Bibliographically approved
Kucherenko, T., Hasegawa, D., Naoshi, K., Henter, G. E. & Kjellström, H. (2019). On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract. In: : . Paper presented at International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada (pp. 2072-2074). The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS)
Open this publication in new window or tab >>On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract
Show others...
2019 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features as input and produces gestures in the form of sequences of 3D joint coordinates representing motion as output. The results of objective and subjective evaluations confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2019
Keywords
Gesture generation; social robotics; representation learning; neural network; deep learning; virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-251648 (URN)
Conference
International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190515

Available from: 2019-05-16 Created: 2019-05-16 Last updated: 2019-05-22Bibliographically approved
Wolfert, P., Kucherenko, T., Kjellström, H. & Belpaeme, T. (2019). Should Beat Gestures Be Learned Or Designed?: A Benchmarking User Study. In: ICDL-EPIROB 2019: Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions. Paper presented at ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions. IEEE conference proceedings
Open this publication in new window or tab >>Should Beat Gestures Be Learned Or Designed?: A Benchmarking User Study
2019 (English)In: ICDL-EPIROB 2019: Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, IEEE conference proceedings, 2019Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present a user study on gener-ated beat gestures for humanoid agents. It has been shownthat Human-Robot Interaction can be improved by includingcommunicative non-verbal behavior, such as arm gestures. Beatgestures are one of the four types of arm gestures, and are knownto be used for emphasizing parts of speech. In our user study,we compare beat gestures learned from training data with hand-crafted beat gestures. The first kind of gestures are generatedby a machine learning model trained on speech audio andhuman upper body poses. We compared this approach with threehand-coded beat gestures methods: designed beat gestures, timedbeat gestures, and noisy gestures. Forty-one subjects participatedin our user study, and a ranking was derived from pairedcomparisons using the Bradley Terry Luce model. We found thatfor beat gestures, the gestures from the machine learning modelare preferred, followed by algorithmically generated gestures.This emphasizes the promise of machine learning for generating communicative actions.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2019
Keywords
gesture generation, machine learning, beat gestures, user study, virtual agents
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-255998 (URN)
Conference
ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions
Note

QC 20190815

Available from: 2019-08-14 Created: 2019-08-14 Last updated: 2019-08-15Bibliographically approved
Kucherenko, T. (2018). Data Driven Non-Verbal Behavior Generation for Humanoid Robots. In: : . Paper presented at 2018 International Conference on Multimodal Interaction (ICMI ’18), October 16–20, 2018, Boulder, CO, USA (pp. 520-523). Boulder, CO, USA: ACM Digital Library
Open this publication in new window or tab >>Data Driven Non-Verbal Behavior Generation for Humanoid Robots
2018 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Social robots need non-verbal behavior to make an interaction pleasant and efficient. Most of the models for generating non-verbal behavior are rule-based and hence can produce a limited set of motions and are tuned to a particular scenario. In contrast, datadriven systems are flexible and easily adjustable. Hence we aim to learn a data-driven model for generating non-verbal behavior (in a form of a 3D motion sequence) for humanoid robots. Our approach is based on a popular and powerful deep generative model: Variation Autoencoder (VAE). Input for our model will be multi-modal and we will iteratively increase its complexity: first, it will only use the speech signal, then also the text transcription and finally - the non-verbal behavior of the conversation partner. We will evaluate our system on the virtual avatars as well as on two humanoid robots with different embodiments: NAO and Furhat. Our model will be easily adapted to a novel domain: this can be done by providing application specific training data.

Place, publisher, year, edition, pages
Boulder, CO, USA: ACM Digital Library, 2018
Keywords
Non-verbal behavior, data driven systems, machine learning, deep learning, humanoid robot
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-238617 (URN)10.1145/3242969.3264970 (DOI)000457913100073 ()2-s2.0-85056642092 (Scopus ID)978-1-4503-5692-3 (ISBN)
Conference
2018 International Conference on Multimodal Interaction (ICMI ’18), October 16–20, 2018, Boulder, CO, USA
Projects
EACare
Funder
Swedish Foundation for Strategic Research , 7085
Note

QC 20181106

Available from: 2018-11-05 Created: 2018-11-05 Last updated: 2019-03-18Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9838-8848

Search in DiVA

Show all publications