Change search
Link to record
Permanent link

Direct link
BETA
Alternative names
Publications (7 of 7) Show all publications
Alexanderson, S., Henter, G. E., Kucherenko, T. & Beskow, J. (2020). Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows. In: : . Paper presented at EUROGRAPHICS 2020.
Open this publication in new window or tab >>Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows
2020 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-268363 (URN)
Conference
EUROGRAPHICS 2020
Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-02-18
Kucherenko, T., Hasegawa, D., Henter, G. E., Kaneko, N. & Kjellström, H. (2019). Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In: 19th ACM International Conference on Intelligent Virtual Agents: . Paper presented at 19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France. New York, NY, USA: ACM Publications
Open this publication in new window or tab >>Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Show others...
2019 (English)In: 19th ACM International Conference on Intelligent Virtual Agents, New York, NY, USA: ACM Publications, 2019Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
New York, NY, USA: ACM Publications, 2019
Keywords
Gesture generation, social robotics, representation learning, neural network, deep learning, gesture synthesis, virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-255035 (URN)10.1145/3308532.3329472 (DOI)2-s2.0-85069654899 (Scopus ID)978-1-4503-6672-4 (ISBN)
Conference
19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190902

Available from: 2019-07-16 Created: 2019-07-16 Last updated: 2019-09-02Bibliographically approved
Jonell, P., Kucherenko, T., Ekstedt, E. & Beskow, J. (2019). Learning Non-verbal Behavior for a Social Robot from YouTube Videos. In: : . Paper presented at ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, Oslo, Norway, August 19, 2019.
Open this publication in new window or tab >>Learning Non-verbal Behavior for a Social Robot from YouTube Videos
2019 (English)Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

Non-verbal behavior is crucial for positive perception of humanoid robots. If modeled well it can improve the interaction and leave the user with a positive experience, on the other hand, if it is modelled poorly it may impede the interaction and become a source of distraction. Most of the existing work on modeling non-verbal behavior show limited variability due to the fact that the models employed are deterministic and the generated motion can be perceived as repetitive and predictable. In this paper, we present a novel method for generation of a limited set of facial expressions and head movements, based on a probabilistic generative deep learning architecture called Glow. We have implemented a workflow which takes videos directly from YouTube, extracts relevant features, and trains a model that generates gestures that can be realized in a robot without any post processing. A user study was conducted and illustrated the importance of having any kind of non-verbal behavior while most differences between the ground truth, the proposed method, and a random control were not significant (however, the differences that were significant were in favor of the proposed method).

Keywords
Facial expressions, non-verbal behavior, generative models, neural network, head movement, social robotics
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-261242 (URN)
Conference
ICDL-EpiRob Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, Oslo, Norway, August 19, 2019
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20191007

Available from: 2019-10-03 Created: 2019-10-03 Last updated: 2019-10-07Bibliographically approved
Kucherenko, T., Hasegawa, D., Naoshi, K., Henter, G. E. & Kjellström, H. (2019). On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract. In: : . Paper presented at International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada (pp. 2072-2074). The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS)
Open this publication in new window or tab >>On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract
Show others...
2019 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features as input and produces gestures in the form of sequences of 3D joint coordinates representing motion as output. The results of objective and subjective evaluations confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2019
Keywords
Gesture generation; social robotics; representation learning; neural network; deep learning; virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-251648 (URN)000474345000309 ()
Conference
International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190515

Available from: 2019-05-16 Created: 2019-05-16 Last updated: 2019-10-25Bibliographically approved
Wolfert, P., Kucherenko, T., Kjellström, H. & Belpaeme, T. (2019). Should Beat Gestures Be Learned Or Designed?: A Benchmarking User Study. In: ICDL-EPIROB 2019: Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions. Paper presented at ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions. IEEE conference proceedings
Open this publication in new window or tab >>Should Beat Gestures Be Learned Or Designed?: A Benchmarking User Study
2019 (English)In: ICDL-EPIROB 2019: Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions, IEEE conference proceedings, 2019Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we present a user study on gener-ated beat gestures for humanoid agents. It has been shownthat Human-Robot Interaction can be improved by includingcommunicative non-verbal behavior, such as arm gestures. Beatgestures are one of the four types of arm gestures, and are knownto be used for emphasizing parts of speech. In our user study,we compare beat gestures learned from training data with hand-crafted beat gestures. The first kind of gestures are generatedby a machine learning model trained on speech audio andhuman upper body poses. We compared this approach with threehand-coded beat gestures methods: designed beat gestures, timedbeat gestures, and noisy gestures. Forty-one subjects participatedin our user study, and a ranking was derived from pairedcomparisons using the Bradley Terry Luce model. We found thatfor beat gestures, the gestures from the machine learning modelare preferred, followed by algorithmically generated gestures.This emphasizes the promise of machine learning for generating communicative actions.

Place, publisher, year, edition, pages
IEEE conference proceedings, 2019
Keywords
gesture generation, machine learning, beat gestures, user study, virtual agents
National Category
Human Computer Interaction
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-255998 (URN)
Conference
ICDL-EPIROB 2019 Workshop on Naturalistic Non-Verbal and Affective Human-Robot Interactions
Note

QC 20190815

Available from: 2019-08-14 Created: 2019-08-14 Last updated: 2019-08-15Bibliographically approved
Kucherenko, T. (2018). Data Driven Non-Verbal Behavior Generation for Humanoid Robots. In: : . Paper presented at 2018 International Conference on Multimodal Interaction (ICMI ’18), October 16–20, 2018, Boulder, CO, USA (pp. 520-523). Boulder, CO, USA: ACM Digital Library
Open this publication in new window or tab >>Data Driven Non-Verbal Behavior Generation for Humanoid Robots
2018 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Social robots need non-verbal behavior to make an interaction pleasant and efficient. Most of the models for generating non-verbal behavior are rule-based and hence can produce a limited set of motions and are tuned to a particular scenario. In contrast, datadriven systems are flexible and easily adjustable. Hence we aim to learn a data-driven model for generating non-verbal behavior (in a form of a 3D motion sequence) for humanoid robots. Our approach is based on a popular and powerful deep generative model: Variation Autoencoder (VAE). Input for our model will be multi-modal and we will iteratively increase its complexity: first, it will only use the speech signal, then also the text transcription and finally - the non-verbal behavior of the conversation partner. We will evaluate our system on the virtual avatars as well as on two humanoid robots with different embodiments: NAO and Furhat. Our model will be easily adapted to a novel domain: this can be done by providing application specific training data.

Place, publisher, year, edition, pages
Boulder, CO, USA: ACM Digital Library, 2018
Keywords
Non-verbal behavior, data driven systems, machine learning, deep learning, humanoid robot
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-238617 (URN)10.1145/3242969.3264970 (DOI)000457913100073 ()2-s2.0-85056642092 (Scopus ID)978-1-4503-5692-3 (ISBN)
Conference
2018 International Conference on Multimodal Interaction (ICMI ’18), October 16–20, 2018, Boulder, CO, USA
Projects
EACare
Funder
Swedish Foundation for Strategic Research , 7085
Note

QC 20181106

Available from: 2018-11-05 Created: 2018-11-05 Last updated: 2019-03-18Bibliographically approved
Jonell, P., Mendelson, J., Storskog, T., Hagman, G., Ostberg, P., Leite, I., . . . Kjellström, H. (2017). Machine Learning and Social Robotics for Detecting Early Signs of Dementia.
Open this publication in new window or tab >>Machine Learning and Social Robotics for Detecting Early Signs of Dementia
Show others...
2017 (English)Other (Other academic)
National Category
Geriatrics
Identifiers
urn:nbn:se:kth:diva-268358 (URN)
Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-02-18
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9838-8848

Search in DiVA

Show all publications