kth.sePublications
Change search
Link to record
Permanent link

Direct link
Henter, Gustav Eje, Assistant ProfessorORCID iD iconorcid.org/0000-0002-1643-1054
Alternative names
Publications (10 of 66) Show all publications
Kucherenko, T., Wolfert, P., Yoon, Y., Viegas, C., Nikolov, T., Tsakov, M. & Henter, G. E. (2024). Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022. ACM Transactions on Graphics, 43(3), Article ID 32.
Open this publication in new window or tab >>Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022
Show others...
2024 (English)In: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 43, no 3, article id 32Article in journal (Refereed) Published
Abstract [en]

This article reports on the second GENEA Challenge to benchmark datadriven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research articles, differences in results are here only due to differences betweenmethods, enabling direct comparison between systems. The dataset was based on 18 hours of fullbody motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Frechet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around -0.5. Based on the challenge results we formulate numerous recommendations for system building and evaluation.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Animation, gesture generation, embodied, conversational agents, evaluation paradigms
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-352263 (URN)10.1145/3656374 (DOI)001265558400008 ()2-s2.0-85192703805 (Scopus ID)
Note

QC 20240827

Available from: 2024-08-27 Created: 2024-08-27 Last updated: 2024-09-06Bibliographically approved
Wennberg, U. & Henter, G. E. (2024). Exploring Internal Numeracy in Language Models: A Case Study on ALBERT. In: MathNLP 2024: 2nd Workshop on Mathematical Natural Language Processing at LREC-COLING 2024 - Workshop Proceedings: . Paper presented at 2nd Workshop on Mathematical Natural Language Processing, MathNLP 2024, Torino, Italy, May 21 2024 (pp. 35-40). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Exploring Internal Numeracy in Language Models: A Case Study on ALBERT
2024 (English)In: MathNLP 2024: 2nd Workshop on Mathematical Natural Language Processing at LREC-COLING 2024 - Workshop Proceedings, European Language Resources Association (ELRA) , 2024, p. 35-40Conference paper, Published paper (Refereed)
Abstract [en]

It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
Language models, Numerals in NLP, Numerical data representation, PCA, Transformer-based models, Word embeddings
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-347702 (URN)2-s2.0-85195172546 (Scopus ID)
Conference
2nd Workshop on Mathematical Natural Language Processing, MathNLP 2024, Torino, Italy, May 21 2024
Note

QC 20240613

Part of ISBN 978-249381422-7

Available from: 2024-06-13 Created: 2024-06-13 Last updated: 2025-02-07Bibliographically approved
Wolfert, P., Henter, G. E. & Belpaeme, T. (2024). Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour. Applied Sciences, 14(4), Article ID 1460.
Open this publication in new window or tab >>Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour
2024 (English)In: Applied Sciences, E-ISSN 2076-3417, Vol. 14, no 4, article id 1460Article in journal (Refereed) Published
Abstract [en]

This paper compares three methods for evaluating computer-generated motion behaviour for animated characters: two commonly used direct rating methods and a newly designed questionnaire. The questionnaire is specifically designed to measure the human-likeness, appropriateness, and intelligibility of the generated motion. Furthermore, this study investigates the suitability of these evaluation tools for assessing subtle forms of human behaviour, such as the subdued motion cues shown when listening to someone. This paper reports six user studies, namely studies that directly rate the appropriateness and human-likeness of a computer character's motion, along with studies that instead rely on a questionnaire to measure the quality of the motion. As test data, we used the motion generated by two generative models and recorded human gestures, which served as a gold standard. Our findings indicate that when evaluating gesturing motion, the direct rating of human-likeness and appropriateness is to be preferred over a questionnaire. However, when assessing the subtle motion of a computer character, even the direct rating method yields less conclusive results. Despite demonstrating high internal consistency, our questionnaire proves to be less sensitive than directly rating the quality of the motion. The results provide insights into the evaluation of human motion behaviour and highlight the complexities involved in capturing subtle nuances in nonverbal communication. These findings have implications for the development and improvement of motion generation models and can guide researchers in selecting appropriate evaluation methodologies for specific aspects of human behaviour.

Place, publisher, year, edition, pages
MDPI AG, 2024
Keywords
human-computer interaction, embodied conversational agents, subjective evaluations
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-344465 (URN)10.3390/app14041460 (DOI)001170953500001 ()2-s2.0-85192447790 (Scopus ID)
Note

QC 20240318

Available from: 2024-03-18 Created: 2024-03-18 Last updated: 2024-05-16Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 1952-1964Conference paper, Published paper (Refereed)
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-355103 (URN)
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projects
bodytalk
Note

QC 20241022

Available from: 2024-10-22 Created: 2024-10-22 Last updated: 2024-10-22Bibliographically approved
Yoon, Y., Kucherenko, T., Delbosc, A., Nagy, R., Nikolov, T. & Henter, G. E. (2024). GENEA Workshop 2024: The 5th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents. In: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024: . Paper presented at Companion International Conference on Multimodal Interaction, NOV 04-08, 2024, San Jose, COSTA RICA (pp. 694-695). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>GENEA Workshop 2024: The 5th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents
Show others...
2024 (English)In: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024, Association for Computing Machinery (ACM) , 2024, p. 694-695Conference paper, Published paper (Refereed)
Abstract [en]

Non-verbal behavior offers significant benefits for embodied agents in human interactions. Despite extensive research on the creation of non-verbal behaviors, the field lacks a standardized benchmarking practice. Researchers rarely compare their findings with previous studies, and when they do, the comparisons are often not aligned with other methodologies. The GENEA Workshop 2024 aims to unite the community to discuss major challenges and solutions, and to determine the most effective ways to advance the field.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
behavior synthesis, gesture generation, datasets, evaluation
National Category
Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-362970 (URN)10.1145/3678957.3688818 (DOI)001433669800083 ()2-s2.0-85212592892 (Scopus ID)979-8-4007-0462-8 (ISBN)
Conference
Companion International Conference on Multimodal Interaction, NOV 04-08, 2024, San Jose, COSTA RICA
Note

QC 20250430

Available from: 2025-04-30 Created: 2025-04-30 Last updated: 2025-04-30Bibliographically approved
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Show others...
2024 (English)In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 11341-11345Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Note

Part of ISBN 9798350344851

QC 20240716

Available from: 2024-07-16 Created: 2024-07-16 Last updated: 2025-03-26Bibliographically approved
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Open this publication in new window or tab >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Show others...
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2285-2289Conference paper, Published paper (Refereed)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)2-s2.0-85214793947 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-02-25Bibliographically approved
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Unified speech and gesture synthesis using flow matching
Show others...
2024 (English)In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 8220-8224Conference paper, Published paper (Refereed)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
National Category
Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Note

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Available from: 2025-04-02 Created: 2025-04-02 Last updated: 2025-04-09Bibliographically approved
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS. In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings. Paper presented at 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
2023 (English)In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
self-supervised speech representation, speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-335090 (URN)10.1109/ICASSPW59220.2023.10193157 (DOI)001046933700056 ()2-s2.0-85165623363 (Scopus ID)
Conference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023
Note

Part of ISBN 9798350302615

QC 20230831

Available from: 2023-08-31 Created: 2023-08-31 Last updated: 2025-02-07Bibliographically approved
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A comparative study of self-supervised speech representationsin read and spontaneous TTS. Paper presented at 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece.
Open this publication in new window or tab >>A comparative study of self-supervised speech representationsin read and spontaneous TTS
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts

Keywords
speech synthesis, self-supervised speech representation, spontaneous speech
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Other Engineering and Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328741 (URN)979-8-3503-0261-5 (ISBN)
Conference
2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece
Projects
Digital Futures project Advanced Adaptive Intelligent Systems (AAIS)Swedish Research Council project Connected (VR-2019-05003)Swedish Research Council project Perception of speaker stance (VR-2020- 02396)Riksbankens Jubileumsfond project CAPTivating (P20-0298)Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation
Note

Accepted by the 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece

QC 20230620

Available from: 2023-06-12 Created: 2023-06-12 Last updated: 2025-02-18Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-1643-1054