Endre søk
Link to record
Permanent link

Direct link
Henter, Gustav Eje, Assistant ProfessorORCID iD iconorcid.org/0000-0002-1643-1054
Alternativa namn
Publikasjoner (10 av 67) Visa alla publikasjoner
Tuttösí, P., Mehta, S., Syvenky, Z., Burkanova, B., Hfsafsti, M., Wang, Y., . . . Lim, A. (2025). Take a Look, it's in a Book, a Reading Robot. In: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction: . Paper presented at 20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025 (pp. 1803-1805). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Take a Look, it's in a Book, a Reading Robot
Vise andre…
2025 (engelsk)Inngår i: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction, Institute of Electrical and Electronics Engineers (IEEE) , 2025, s. 1803-1805Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We demonstrate EmojiVoice, a free, customizable text-to-speech (TTS) toolkit for expressive speech on social robots. We demonstrate our voices through storytelling. This task is aimed to be deployed in classrooms, or libraries where the robot can read a story out loud to children. Moreover, we introduce adaptive clarity to to noisy environments and those with reduced comprehension ability. This storytelling robot voice allows us to demonstrate how, using our light weight and customizable TTS, we are able to have a voice that is expressive, engaging, clear and socially appropriate for the task, improving interactions with and perceptions of social robots.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2025
Emneord
clear speech synthesis, education robots, Expressive speech synthesis, human robot interaction, noise robust speech synthesis, second language speakers, social robotics, storytelling robots
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-363761 (URN)10.1109/HRI61500.2025.10973801 (DOI)2-s2.0-105004876693 (Scopus ID)
Konferanse
20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025
Merknad

 Part of ISBN 9798350378931 QC 20250526

Tilgjengelig fra: 2025-05-21 Laget: 2025-05-21 Sist oppdatert: 2025-05-26bibliografisk kontrollert
Kucherenko, T., Wolfert, P., Yoon, Y., Viegas, C., Nikolov, T., Tsakov, M. & Henter, G. E. (2024). Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022. ACM Transactions on Graphics, 43(3), Article ID 32.
Åpne denne publikasjonen i ny fane eller vindu >>Evaluating Gesture Generation in a Large-scale Open Challenge: The GENEA Challenge 2022
Vise andre…
2024 (engelsk)Inngår i: ACM Transactions on Graphics, ISSN 0730-0301, E-ISSN 1557-7368, Vol. 43, nr 3, artikkel-id 32Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

This article reports on the second GENEA Challenge to benchmark datadriven automatic co-speech gesture generation. Participating teams used the same speech and motion dataset to build gesture-generation systems. Motion generated by all these systems was rendered to video using a standardised visualisation pipeline and evaluated in several large, crowdsourced user studies. Unlike when comparing different research articles, differences in results are here only due to differences betweenmethods, enabling direct comparison between systems. The dataset was based on 18 hours of fullbody motion capture, including fingers, of different persons engaging in a dyadic conversation. Ten teams participated in the challenge across two tiers: full-body and upper-body gesticulation. For each tier, we evaluated both the human-likeness of the gesture motion and its appropriateness for the specific speech signal. Our evaluations decouple human-likeness from gesture appropriateness, which has been a difficult problem in the field. The evaluation results show some synthetic gesture conditions being rated as significantly more human-like than 3D human motion capture. To the best of our knowledge, this has not been demonstrated before. On the other hand, all synthetic motion is found to be vastly less appropriate for the speech than the original motion-capture recordings. We also find that conventional objective metrics do not correlate well with subjective human-likeness ratings in this large evaluation. The one exception is the Frechet gesture distance (FGD), which achieves a Kendall's tau rank correlation of around -0.5. Based on the challenge results we formulate numerous recommendations for system building and evaluation.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2024
Emneord
Animation, gesture generation, embodied, conversational agents, evaluation paradigms
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-352263 (URN)10.1145/3656374 (DOI)001265558400008 ()2-s2.0-85192703805 (Scopus ID)
Merknad

QC 20240827

Tilgjengelig fra: 2024-08-27 Laget: 2024-08-27 Sist oppdatert: 2024-09-06bibliografisk kontrollert
Wennberg, U. & Henter, G. E. (2024). Exploring Internal Numeracy in Language Models: A Case Study on ALBERT. In: MathNLP 2024: 2nd Workshop on Mathematical Natural Language Processing at LREC-COLING 2024 - Workshop Proceedings: . Paper presented at 2nd Workshop on Mathematical Natural Language Processing, MathNLP 2024, Torino, Italy, May 21 2024 (pp. 35-40). European Language Resources Association (ELRA)
Åpne denne publikasjonen i ny fane eller vindu >>Exploring Internal Numeracy in Language Models: A Case Study on ALBERT
2024 (engelsk)Inngår i: MathNLP 2024: 2nd Workshop on Mathematical Natural Language Processing at LREC-COLING 2024 - Workshop Proceedings, European Language Resources Association (ELRA) , 2024, s. 35-40Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

It has been found that Transformer-based language models have the ability to perform basic quantitative reasoning. In this paper, we propose a method for studying how these models internally represent numerical data, and use our proposal to analyze the ALBERT family of language models. Specifically, we extract the learned embeddings these models use to represent tokens that correspond to numbers and ordinals, and subject these embeddings to Principal Component Analysis (PCA). PCA results reveal that ALBERT models of different sizes, trained and initialized separately, consistently learn to use the axes of greatest variation to represent the approximate ordering of various numerical concepts. Numerals and their textual counterparts are represented in separate clusters, but increase along the same direction in 2D space. Our findings illustrate that language models, trained purely to model text, can intuit basic mathematical concepts, opening avenues for NLP applications that intersect with quantitative reasoning.

sted, utgiver, år, opplag, sider
European Language Resources Association (ELRA), 2024
Emneord
Language models, Numerals in NLP, Numerical data representation, PCA, Transformer-based models, Word embeddings
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-347702 (URN)2-s2.0-85195172546 (Scopus ID)
Konferanse
2nd Workshop on Mathematical Natural Language Processing, MathNLP 2024, Torino, Italy, May 21 2024
Merknad

QC 20240613

Part of ISBN 978-249381422-7

Tilgjengelig fra: 2024-06-13 Laget: 2024-06-13 Sist oppdatert: 2025-02-07bibliografisk kontrollert
Wolfert, P., Henter, G. E. & Belpaeme, T. (2024). Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour. Applied Sciences, 14(4), Article ID 1460.
Åpne denne publikasjonen i ny fane eller vindu >>Exploring the Effectiveness of Evaluation Practices for Computer-Generated Nonverbal Behaviour
2024 (engelsk)Inngår i: Applied Sciences, E-ISSN 2076-3417, Vol. 14, nr 4, artikkel-id 1460Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

This paper compares three methods for evaluating computer-generated motion behaviour for animated characters: two commonly used direct rating methods and a newly designed questionnaire. The questionnaire is specifically designed to measure the human-likeness, appropriateness, and intelligibility of the generated motion. Furthermore, this study investigates the suitability of these evaluation tools for assessing subtle forms of human behaviour, such as the subdued motion cues shown when listening to someone. This paper reports six user studies, namely studies that directly rate the appropriateness and human-likeness of a computer character's motion, along with studies that instead rely on a questionnaire to measure the quality of the motion. As test data, we used the motion generated by two generative models and recorded human gestures, which served as a gold standard. Our findings indicate that when evaluating gesturing motion, the direct rating of human-likeness and appropriateness is to be preferred over a questionnaire. However, when assessing the subtle motion of a computer character, even the direct rating method yields less conclusive results. Despite demonstrating high internal consistency, our questionnaire proves to be less sensitive than directly rating the quality of the motion. The results provide insights into the evaluation of human motion behaviour and highlight the complexities involved in capturing subtle nuances in nonverbal communication. These findings have implications for the development and improvement of motion generation models and can guide researchers in selecting appropriate evaluation methodologies for specific aspects of human behaviour.

sted, utgiver, år, opplag, sider
MDPI AG, 2024
Emneord
human-computer interaction, embodied conversational agents, subjective evaluations
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-344465 (URN)10.3390/app14041460 (DOI)001170953500001 ()2-s2.0-85192447790 (Scopus ID)
Merknad

QC 20240318

Tilgjengelig fra: 2024-03-18 Laget: 2024-03-18 Sist oppdatert: 2024-05-16bibliografisk kontrollert
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Åpne denne publikasjonen i ny fane eller vindu >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Vise andre…
2024 (engelsk)Inngår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferansepaper, Publicerat paper (Fagfellevurdert)
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferanse
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Prosjekter
bodytalk
Merknad

QC 20241022

Tilgjengelig fra: 2024-10-22 Laget: 2024-10-22 Sist oppdatert: 2024-10-22bibliografisk kontrollert
Yoon, Y., Kucherenko, T., Delbosc, A., Nagy, R., Nikolov, T. & Henter, G. E. (2024). GENEA Workshop 2024: The 5th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents. In: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024: . Paper presented at Companion International Conference on Multimodal Interaction, NOV 04-08, 2024, San Jose, COSTA RICA (pp. 694-695). Association for Computing Machinery (ACM)
Åpne denne publikasjonen i ny fane eller vindu >>GENEA Workshop 2024: The 5th Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents
Vise andre…
2024 (engelsk)Inngår i: PROCEEDINGS OF THE 26TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2024, Association for Computing Machinery (ACM) , 2024, s. 694-695Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Non-verbal behavior offers significant benefits for embodied agents in human interactions. Despite extensive research on the creation of non-verbal behaviors, the field lacks a standardized benchmarking practice. Researchers rarely compare their findings with previous studies, and when they do, the comparisons are often not aligned with other methodologies. The GENEA Workshop 2024 aims to unite the community to discuss major challenges and solutions, and to determine the most effective ways to advance the field.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2024
Emneord
behavior synthesis, gesture generation, datasets, evaluation
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-362970 (URN)10.1145/3678957.3688818 (DOI)001433669800083 ()2-s2.0-85212592892 (Scopus ID)
Konferanse
Companion International Conference on Multimodal Interaction, NOV 04-08, 2024, San Jose, COSTA RICA
Merknad

Part of ISBN 979-8-4007-0462-8

QC 20250430

Tilgjengelig fra: 2025-04-30 Laget: 2025-04-30 Sist oppdatert: 2025-06-03bibliografisk kontrollert
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Vise andre…
2024 (engelsk)Inngår i: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 11341-11345Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Emneord
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Konferanse
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Merknad

Part of ISBN 9798350344851

QC 20240716

Tilgjengelig fra: 2024-07-16 Laget: 2024-07-16 Sist oppdatert: 2025-03-26bibliografisk kontrollert
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Vise andre…
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2285-2289Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)2-s2.0-85214793947 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250127

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-02-25bibliografisk kontrollert
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Unified speech and gesture synthesis using flow matching
Vise andre…
2024 (engelsk)Inngår i: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 8220-8224Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Serie
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Emneord
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Konferanse
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Merknad

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Tilgjengelig fra: 2025-04-02 Laget: 2025-04-02 Sist oppdatert: 2025-04-09bibliografisk kontrollert
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS. In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings. Paper presented at 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023. Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
2023 (engelsk)Inngår i: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2023
Emneord
self-supervised speech representation, speech synthesis, spontaneous speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-335090 (URN)10.1109/ICASSPW59220.2023.10193157 (DOI)001046933700056 ()2-s2.0-85165623363 (Scopus ID)
Konferanse
2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023
Merknad

Part of ISBN 9798350302615

QC 20230831

Tilgjengelig fra: 2023-08-31 Laget: 2023-08-31 Sist oppdatert: 2025-02-07bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-1643-1054