Change search
Link to record
Permanent link

Direct link
BETA
Henter, Gustav Eje, Assistant Professor
Publications (10 of 13) Show all publications
Alexanderson, S., Henter, G. E., Kucherenko, T. & Beskow, J. (2020). Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows. In: : . Paper presented at EUROGRAPHICS 2020.
Open this publication in new window or tab >>Style-Controllable Speech-Driven Gesture SynthesisUsing Normalising Flows
2020 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Automatic synthesis of realistic gestures promises to transform the fields of animation, avatars and communicative agents. In off-line applications, novel tools can alter the role of an animator to that of a director, who provides only high-level input for the desired animation; a learned network then translates these instructions into an appropriate sequence of body poses. In interactive scenarios, systems for generating natural animations on the fly are key to achieving believable and relatable characters. In this paper we address some of the core issues towards these ends. By adapting a deep learning-based motion synthesis method called MoGlow, we propose a new generative model for generating state-of-the-art realistic speech-driven gesticulation. Owing to the probabilistic nature of the approach, our model can produce a battery of different, yet plausible, gestures given the same input speech signal. Just like humans, this gives a rich natural variation of motion. We additionally demonstrate the ability to exert directorial control over the output style, such as gesture level, speed, symmetry and spacial extent. Such control can be leveraged to convey a desired character personality or mood. We achieve all this without any manual annotation of the data. User studies evaluating upper-body gesticulation confirm that the generated motions are natural and well match the input speech. Our method scores above all prior systems and baselines on these measures, and comes close to the ratings of the original recorded motions. We furthermore find that we can accurately control gesticulation styles without unnecessarily compromising perceived naturalness. Finally, we also demonstrate an application of the same method to full-body gesticulation, including the synthesis of stepping motion and stance.

National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-268363 (URN)
Conference
EUROGRAPHICS 2020
Note

QCR 20200513

Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-05-13Bibliographically approved
Kucherenko, T., Hasegawa, D., Henter, G. E., Kaneko, N. & Kjellström, H. (2019). Analyzing Input and Output Representations for Speech-Driven Gesture Generation. In: 19th ACM International Conference on Intelligent Virtual Agents: . Paper presented at 19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France. New York, NY, USA: ACM Publications
Open this publication in new window or tab >>Analyzing Input and Output Representations for Speech-Driven Gesture Generation
Show others...
2019 (English)In: 19th ACM International Conference on Intelligent Virtual Agents, New York, NY, USA: ACM Publications, 2019Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.

Our approach consists of two steps. First, we learn a lower-dimensional representation of human motion using a denoising autoencoder neural network, consisting of a motion encoder MotionE and a motion decoder MotionD. The learned representation preserves the most important aspects of the human pose variation while removing less relevant variation. Second, we train a novel encoder network SpeechE to map from speech to a corresponding motion representation with reduced dimensionality. At test time, the speech encoder and the motion decoder networks are combined: SpeechE predicts motion representations based on a given speech signal and MotionD then decodes these representations to produce motion sequences.

We evaluate different representation sizes in order to find the most effective dimensionality for the representation. We also evaluate the effects of using different speech features as input to the model. We find that mel-frequency cepstral coefficients (MFCCs), alone or combined with prosodic features, perform the best. The results of a subsequent user study confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
New York, NY, USA: ACM Publications, 2019
Keywords
Gesture generation, social robotics, representation learning, neural network, deep learning, gesture synthesis, virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-255035 (URN)10.1145/3308532.3329472 (DOI)2-s2.0-85069654899 (Scopus ID)978-1-4503-6672-4 (ISBN)
Conference
19th ACM International Conference on Intelligent Virtual Agents (IVA '19),July 2-5,2019,Paris, France
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190902

Available from: 2019-07-16 Created: 2019-07-16 Last updated: 2019-09-02Bibliographically approved
Székely, É., Henter, G. E. & Gustafson, J. (2019). CASTING TO CORPUS: SEGMENTING AND SELECTING SPONTANEOUS DIALOGUE FOR TTS WITH A CNN-LSTM SPEAKER-DEPENDENT BREATH DETECTOR. In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP): . Paper presented at 44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 12-17, 2019, Brighton, ENGLAND (pp. 6925-6929). IEEE
Open this publication in new window or tab >>CASTING TO CORPUS: SEGMENTING AND SELECTING SPONTANEOUS DIALOGUE FOR TTS WITH A CNN-LSTM SPEAKER-DEPENDENT BREATH DETECTOR
2019 (English)In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE , 2019, p. 6925-6929Conference paper, Published paper (Refereed)
Abstract [en]

This paper considers utilising breaths to create improved spontaneous-speech corpora for conversational text-to-speech from found audio recordings such as dialogue podcasts. Breaths are of interest since they relate to prosody and speech planning and are independent of language and transcription. Specifically, we propose a semisupervised approach where a fraction of coarsely annotated data is used to train a convolutional and recurrent speaker-specific breath detector operating on spectrograms and zero-crossing rate. The classifier output is used to find target-speaker breath groups (audio segments delineated by breaths) and subsequently select those that constitute clean utterances appropriate for a synthesis corpus. An application to 11 hours of raw podcast audio extracts 1969 utterances (106 minutes), 87% of which are clean and correctly segmented. This outperforms a baseline that performs integrated VAD and speaker attribution without accounting for breaths.

Place, publisher, year, edition, pages
IEEE, 2019
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Spontaneous speech, found data, speech synthesis corpora, breath detection, computational paralinguistics
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-261049 (URN)10.1109/ICASSP.2019.8683846 (DOI)000482554007032 ()2-s2.0-85069442973 (Scopus ID)978-1-4799-8131-1 (ISBN)
Conference
44th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 12-17, 2019, Brighton, ENGLAND
Note

QC 20191002

Available from: 2019-10-02 Created: 2019-10-02 Last updated: 2019-10-02Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). How to train your fillers: uh and um in spontaneous speech synthesis. In: : . Paper presented at The 10th ISCA Speech Synthesis Workshop.
Open this publication in new window or tab >>How to train your fillers: uh and um in spontaneous speech synthesis
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261693 (URN)
Conference
The 10th ISCA Speech Synthesis Workshop
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2020-04-27Bibliographically approved
Henter, G. E., Alexanderson, S. & Beskow, J. (2019). Moglow: Probabilistic and controllable motion synthesis using normalising flows. arXiv preprint arXiv:1905.06598
Open this publication in new window or tab >>Moglow: Probabilistic and controllable motion synthesis using normalising flows
2019 (English)In: arXiv preprint arXiv:1905.06598Article in journal (Other academic) Published
Abstract [en]

Data-driven modelling and synthesis of motion is an active research area with applications that include animation, games, and social robotics. This paper introduces a new class of probabilistic, generative, and controllable motiondata models based on normalising flows. Models of this kind can describe highly complex distributions, yet can be trained efficiently using exact maximum likelihood, unlike GANs or VAEs. Our proposed model is autoregressive and uses LSTMs to enable arbitrarily long time-dependencies. Importantly, is is also causal, meaning that each pose in the output sequence is generated without access to poses or control inputs from future time steps; this absence of algorithmic latency is important for interactive applications with real-time motion control. The approach can in principle be applied to any type of motion since it does not make restrictive assumptions such as the motion being cyclic in nature. We evaluate the models on motion-capture datasets of human and quadruped locomotion. Objective and subjective results show that randomly-sampled motion from the proposed method attains a motion quality close to recorded motion capture for both humans and animals.

National Category
Computer Vision and Robotics (Autonomous Systems)
Identifiers
urn:nbn:se:kth:diva-268348 (URN)
Note

QC 20200512

Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-05-12Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Off the cuff: Exploring extemporaneous speech delivery with TTS. In: : . Paper presented at Interspeech.
Open this publication in new window or tab >>Off the cuff: Exploring extemporaneous speech delivery with TTS
2019 (English)Conference paper, Published paper (Refereed)
National Category
Engineering and Technology
Identifiers
urn:nbn:se:kth:diva-261691 (URN)
Conference
Interspeech
Note

QC 20191011

Available from: 2019-10-10 Created: 2019-10-10 Last updated: 2019-10-11Bibliographically approved
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Off the cuff: Exploring extemporaneous speech delivery with TTS. In: : . Paper presented at The 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019 | Graz, Austria, Sep. 15-19, 2019. (pp. 3687-3688).
Open this publication in new window or tab >>Off the cuff: Exploring extemporaneous speech delivery with TTS
2019 (English)Conference paper, Published paper (Refereed)
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-260957 (URN)
Conference
The 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019 | Graz, Austria, Sep. 15-19, 2019.
Note

QC 20191113

Available from: 2019-09-30 Created: 2019-09-30 Last updated: 2019-11-13Bibliographically approved
Kucherenko, T., Hasegawa, D., Naoshi, K., Henter, G. E. & Kjellström, H. (2019). On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract. In: : . Paper presented at International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada (pp. 2072-2074). The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS)
Open this publication in new window or tab >>On the Importance of Representations for Speech-Driven Gesture Generation: Extended Abstract
Show others...
2019 (English)Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a novel framework for automatic speech-driven gesture generation applicable to human-agent interaction, including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech features as input and produces gestures in the form of sequences of 3D joint coordinates representing motion as output. The results of objective and subjective evaluations confirm the benefits of the representation learning.

Place, publisher, year, edition, pages
The International Foundation for Autonomous Agents and Multiagent Systems (IFAAMAS), 2019
Keywords
Gesture generation; social robotics; representation learning; neural network; deep learning; virtual agents
National Category
Human Computer Interaction
Research subject
Human-computer Interaction
Identifiers
urn:nbn:se:kth:diva-251648 (URN)000474345000309 ()
Conference
International Conference on Autonomous Agents and Multiagent Systems (AAMAS '19), May 13-17, 2019, Montréal, Canada
Projects
EACare
Funder
Swedish Foundation for Strategic Research , RIT15-0107
Note

QC 20190515

Available from: 2019-05-16 Created: 2019-05-16 Last updated: 2019-10-25Bibliographically approved
Wagner, P., Beskow, J., Betz, S., Edlund, J., Gustafson, J., Henter, G. E., . . . Tånnander, C. (2019). Speech Synthesis Evaluation—State-of-the-Art Assessment and Suggestion for a Novel Research Program. In: Proceedings of the 10th Speech Synthesis Workshop (SSW10): . Paper presented at 10th Speech Synthesis Workshop (SSW10).
Open this publication in new window or tab >>Speech Synthesis Evaluation—State-of-the-Art Assessment and Suggestion for a Novel Research Program
Show others...
2019 (English)In: Proceedings of the 10th Speech Synthesis Workshop (SSW10), 2019Conference paper, Published paper (Refereed)
Abstract
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-268347 (URN)
Conference
10th Speech Synthesis Workshop (SSW10)
Available from: 2020-02-18 Created: 2020-02-18 Last updated: 2020-05-06
Székely, É., Henter, G. E., Beskow, J. & Gustafson, J. (2019). Spontaneous conversational speech synthesis from found data. In: : . Paper presented at The 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019 | Graz, Austria, Sep. 15-19, 2019..
Open this publication in new window or tab >>Spontaneous conversational speech synthesis from found data
2019 (English)Conference paper, Published paper (Refereed)
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-260958 (URN)
Conference
The 20th Annual Conference of the International Speech Communication Association INTERSPEECH 2019 | Graz, Austria, Sep. 15-19, 2019.
Note

QC 20191113

Available from: 2019-09-30 Created: 2019-09-30 Last updated: 2019-11-13Bibliographically approved
Organisations