kth.sePublications
Change search
Link to record
Permanent link

Direct link
Henter, Gustav Eje, Assistant ProfessorORCID iD iconorcid.org/0000-0002-1643-1054
Alternative names
Publications (10 of 58) Show all publications
Wolfert, P., Henter, G. E. & Belpaeme, T. (2024). Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour. Applied Sciences, 14(4), Article ID 1460.
Open this publication in new window or tab >>Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour
2024 (English)In: Applied Sciences, E-ISSN 2076-3417, Vol. 14, no 4, article id 1460Article in journal (Refereed) Published
Abstract [en]

This paper compares three methods for evaluating computer-generated motion behaviour for animated characters: two commonly used direct rating methods and a newly designed questionnaire. The questionnaire is specifically designed to measure the human-likeness, appropriateness, and intelligibility of the generated motion. Furthermore, this study investigates the suitability of these evaluation tools for assessing subtle forms of human behaviour, such as the subdued motion cues shown when listening to someone. This paper reports six user studies, namely studies that directly rate the appropriateness and human-likeness of a computer character's motion, along with studies that instead rely on a questionnaire to measure the quality of the motion. As test data, we used the motion generated by two generative models and recorded human gestures, which served as a gold standard. Our findings indicate that when evaluating gesturing motion, the direct rating of human-likeness and appropriateness is to be preferred over a questionnaire. However, when assessing the subtle motion of a computer character, even the direct rating method yields less conclusive results. Despite demonstrating high internal consistency, our questionnaire proves to be less sensitive than directly rating the quality of the motion. The results provide insights into the evaluation of human motion behaviour and highlight the complexities involved in capturing subtle nuances in nonverbal communication. These findings have implications for the development and improvement of motion generation models and can guide researchers in selecting appropriate evaluation methodologies for specific aspects of human behaviour.

Place, publisher, year, edition, pages
MDPI AG, 2024
Keywords
human-computer interaction, embodied conversational agents, subjective evaluations
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-344465 (URN)10.3390/app14041460 (DOI)001170953500001 ()2-s2.0-85192447790 (Scopus ID)
Note

QC 20240318

Available from: 2024-03-18 Created: 2024-03-18 Last updated: 2024-05-16Bibliographically approved
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS. In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings. Paper presented at 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
2023 (English)In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
self-supervised speech representation, speech synthesis, spontaneous speech
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-335090 (URN)10.1109/ICASSPW59220.2023.10193157 (DOI)001046933700056 ()2-s2.0-85165623363 (Scopus ID)
Conference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023
Note

Part of ISBN 9798350302615

QC 20230831

Available from: 2023-08-31 Created: 2023-08-31 Last updated: 2023-09-26Bibliographically approved
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A comparative study of self-supervised speech representationsin read and spontaneous TTS. Paper presented at 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece.
Open this publication in new window or tab >>A comparative study of self-supervised speech representationsin read and spontaneous TTS
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts

Keywords
speech synthesis, self-supervised speech representation, spontaneous speech
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Interaction Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328741 (URN)979-8-3503-0261-5 (ISBN)
Conference
2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece
Projects
Digital Futures project Advanced Adaptive Intelligent Systems (AAIS)Swedish Research Council project Connected (VR-2019-05003)Swedish Research Council project Perception of speaker stance (VR-2020- 02396)Riksbankens Jubileumsfond project CAPTivating (P20-0298)Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation
Note

Accepted by the 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece

QC 20230620

Available from: 2023-06-12 Created: 2023-06-12 Last updated: 2023-06-20Bibliographically approved
Nyatsanga, S., Kucherenko, T., Ahuja, C., Henter, G. E. & Neff, M. (2023). A Comprehensive Review of Data-Driven Co-Speech Gesture Generation. Computer graphics forum (Print), 42(2), 569-596
Open this publication in new window or tab >>A Comprehensive Review of Data-Driven Co-Speech Gesture Generation
Show others...
2023 (English)In: Computer graphics forum (Print), ISSN 0167-7055, E-ISSN 1467-8659, Vol. 42, no 2, p. 569-596Article in journal (Refereed) Published
Abstract [en]

Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co-speech gestures is a long-standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non-periodic nature of human co-speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep-learning-based generative models that benefit from the growing availability of data. This review article summarizes co-speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule-based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non-linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human-like motion; grounding the gesture in the co-occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

Place, publisher, year, edition, pages
Wiley, 2023
Keywords
CCS Concepts, co-speech gestures, deep learning, gesture generation, social robotics, virtual agents, • Computing methodologies → Animation; Machine learning, • Human-centered computing → Human computer interaction (HCI)
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-331548 (URN)10.1111/cgf.14776 (DOI)001000062600041 ()2-s2.0-85159859544 (Scopus ID)
Note

QC 20230711

Available from: 2023-07-11 Created: 2023-07-11 Last updated: 2023-07-21Bibliographically approved
Pérez Zarazaga, P., Henter, G. E. & Malisz, Z. (2023). A processing framework to access large quantities of whispered speech found in ASMR. In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Paper presented at ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023. Rhodes, Greece: IEEE Signal Processing Society
Open this publication in new window or tab >>A processing framework to access large quantities of whispered speech found in ASMR
2023 (English)In: ICASSP 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes, Greece: IEEE Signal Processing Society, 2023Conference paper, Published paper (Refereed)
Abstract [en]

Whispering is a ubiquitous mode of communication that humansuse daily. Despite this, whispered speech has been poorly servedby existing speech technology due to a shortage of resources andprocessing methodology. To remedy this, this paper provides a pro-cessing framework that enables access to large and unique data ofhigh-quality whispered speech. We obtain the data from recordingssubmitted to online platforms as part of the ASMR media-culturalphenomenon. We describe our processing pipeline and a method forimproved whispered activity detection (WAD) in the ASMR data.To efficiently obtain labelled, clean whispered speech, we comple-ment the automatic WAD by using Edyson, a bulk audio annotationtool with human-in-the-loop. We also tackle a problem particular toASMR: separation of whisper from other acoustic triggers presentin the genre. We show that the proposed WAD and the efficient la-belling allows to build extensively augmented data and train a clas-sifier that extracts clean whisper segments from ASMR audio.Our large and growing dataset enables whisper-capable, data-driven speech technology and linguistic analysis. It also opens op-portunities in e.g. HCI as a resource that may elicit emotional, psy-chological and neuro-physiological responses in the listener.

Place, publisher, year, edition, pages
Rhodes, Greece: IEEE Signal Processing Society, 2023
Keywords
Whispered speech, WAD, human-in-the-loop, autonomous sensory meridian response
National Category
Signal Processing
Research subject
Information and Communication Technology; Human-computer Interaction; Computer Science; Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328771 (URN)10.1109/ICASSP49357.2023.10095965 (DOI)2-s2.0-85177548955 (Scopus ID)
Conference
ICASSP 2023 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4-10 June 2023
Projects
Multimodal encoding of prosodic prominence in voiced and whispered speech
Funder
Swedish Research Council, 2017-02861Wallenberg AI, Autonomous Systems and Software Program (WASP)
Note

QC 20230630

Available from: 2023-06-29 Created: 2023-06-29 Last updated: 2023-11-29Bibliographically approved
Wolfert, P., Henter, G. E. & Belpaeme, T. (2023). "Am I listening?", Evaluating the Quality of Generated Data-driven Listening Motion. In: ICMI 2023 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction. Paper presented at 25th International Conference on Multimodal Interaction, ICMI 2023 Companion, Paris, France, Oct 9 2023 - Oct 13 2023 (pp. 6-10). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>"Am I listening?", Evaluating the Quality of Generated Data-driven Listening Motion
2023 (English)In: ICMI 2023 Companion: Companion Publication of the 25th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2023, p. 6-10Conference paper, Published paper (Refereed)
Abstract [en]

This paper asks if recent models for generating co-speech gesticulation also may learn to exhibit listening behaviour as well. We consider two models from recent gesture-generation challenges and train them on a dataset of audio and 3D motion capture from dyadic conversations. One model is driven by information from both sides of the conversation, whereas the other only uses the character's own speech. Several user studies are performed to assess the motion generated when the character is speaking actively, versus when the character is the listener in the conversation. We find that participants are reliably able to discern motion associated with listening, whether from motion capture or generated by the models. Both models are thus able to produce distinctive listening behaviour, even though only one model is truly a listener, in the sense that it has access to information from the other party in the conversation. Additional experiments on both natural and model-generated motion finds motion associated with listening to be rated as less human-like than motion associated with active speaking.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
embodied conversational agents, listening behaviour
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-339688 (URN)10.1145/3610661.3617160 (DOI)2-s2.0-85175853253 (Scopus ID)
Conference
25th International Conference on Multimodal Interaction, ICMI 2023 Companion, Paris, France, Oct 9 2023 - Oct 13 2023
Note

Part of ISBN 9798400703218

QC 20231116

Available from: 2023-11-16 Created: 2023-11-16 Last updated: 2023-11-16Bibliographically approved
Yoon, Y., Kucherenko, T., Woo, J., Wolfert, P., Nagy, R. & Henter, G. E. (2023). GENEA Workshop 2023: The 4th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents. In: ICMI 2023: Proceedings of the 25th International Conference on Multimodal Interaction. Paper presented at 25th International Conference on Multimodal Interaction, ICMI 2023, Paris, France, Oct 9 2023 - Oct 13 2023 (pp. 822-823). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>GENEA Workshop 2023: The 4th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents
Show others...
2023 (English)In: ICMI 2023: Proceedings of the 25th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2023, p. 822-823Conference paper, Published paper (Refereed)
Abstract [en]

Non-verbal behavior is advantageous for embodied agents when interacting with humans. Despite many years of research on the generation of non-verbal behavior, there is no established benchmarking practice in the field. Most researchers do not compare their results to prior work, and if they do, they often do so in a manner that is not compatible with other approaches. The GENEA Workshop 2023 seeks to bring the community together to discuss the major challenges and solutions, and to identify the best ways to progress the field.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
behavior synthesis, datasets, evaluation, gesture generation
National Category
Communication Studies
Identifiers
urn:nbn:se:kth:diva-339690 (URN)10.1145/3577190.3616856 (DOI)001147764700105 ()2-s2.0-85175832532 (Scopus ID)
Conference
25th International Conference on Multimodal Interaction, ICMI 2023, Paris, France, Oct 9 2023 - Oct 13 2023
Note

Part of ISBN 9798400700552

QC 20231116

Available from: 2023-11-16 Created: 2023-11-16 Last updated: 2024-02-21Bibliographically approved
Alexanderson, S., Nagy, R., Beskow, J. & Henter, G. E. (2023). Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models. ACM Transactions on Graphics, 42(4), Article ID 44.
Open this publication in new window or tab >>Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models
2023 (English)In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 42, no 4, article id 44Article in journal (Refereed) Published
Abstract [en]

Diffusion models have experienced a surge of interest as highly expressive yet efficiently trainable probabilistic models. We show that these models are an excellent fit for synthesising human motion that co-occurs with audio, e.g., dancing and co-speech gesticulation, since motion is complex and highly ambiguous given audio, calling for a probabilistic description. Specifically, we adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. We also demonstrate control over motion style, using classifier-free guidance to adjust the strength of the stylistic expression. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. We also synthesise path-driven locomotion using the same model architecture. Finally, we generalise the guidance procedure to obtain product-of-expert ensembles of diffusion models and demonstrate how these may be used for, e.g., style interpolation, a contribution we believe is of independent interest.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
conformers, dance, diffusion models, ensemble models, generative models, gestures, guided interpolation, locomotion, machine learning, product of experts
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-335345 (URN)10.1145/3592458 (DOI)001044671300010 ()2-s2.0-85166332883 (Scopus ID)
Note

QC 20230907

Available from: 2023-09-07 Created: 2023-09-07 Last updated: 2023-09-22Bibliographically approved
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 4279-4283). International Speech Communication Association
Open this publication in new window or tab >>OverFlow: Putting flows on top of neural transducers for better TTS
Show others...
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 4279-4283Conference paper, Published paper (Refereed)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)2-s2.0-85167953412 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231107

Available from: 2023-11-07 Created: 2023-11-07 Last updated: 2023-11-07Bibliographically approved
Lameris, H., Mehta, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). Prosody-Controllable Spontaneous TTS with Neural HMMs. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP): . Paper presented at International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Prosody-Controllable Spontaneous TTS with Neural HMMs
Show others...
2023 (English)In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
Speech Synthesis, Prosodic Control, NeuralHMM, Spontaneous speech, Creaky voice
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-327893 (URN)10.1109/ICASSP49357.2023.10097200 (DOI)
Conference
International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Funder
Swedish Research Council, VR-2019-05003Swedish Research Council, VR-2020-02396Riksbankens Jubileumsfond, P20-0298Knut and Alice Wallenberg Foundation, WASP
Note

QC 20230602

Available from: 2023-06-01 Created: 2023-06-01 Last updated: 2023-06-02Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-1643-1054