kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 177) Show all publications
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI. In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings: . Paper presented at 16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024 (pp. 290-297). Springer Nature
Open this publication in new window or tab >>A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI
2025 (English)In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings, Springer Nature , 2025, p. 290-297Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with real-time human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions. Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audio-visual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

Place, publisher, year, edition, pages
Springer Nature, 2025
Keywords
Dialogue system, Emotions, Situational Context
National Category
Natural Language Processing Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-362497 (URN)10.1007/978-981-96-3519-1_27 (DOI)2-s2.0-105002141806 (Scopus ID)
Conference
16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024
Note

Part of ISBN 9789819635184

QC 20250424

Available from: 2025-04-16 Created: 2025-04-16 Last updated: 2025-04-24Bibliographically approved
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Open this publication in new window or tab >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)2-s2.0-85214785956 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-28Bibliographically approved
Malmberg, F., Klezovich, A., Mesch, J. & Beskow, J. (2024). Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data. In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024: . Paper presented at 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024 (pp. 219-224). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data
2024 (English)In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Association for Computational Linguistics (ACL) , 2024, p. 219-224Conference paper, Published paper (Refereed)
Abstract [en]

Unsupervised representation learning offers a promising way of utilising large unannotated sign language resources found on the Internet. In this paper, a representation learning model, VQ-VAE, is trained to learn a codebook of motion primitives from sign language data. For training, we use isolated signs and sentences from a sign language dictionary. Three models are trained: one on isolated signs, one on sentences, and one mixed model. We test these models by comparing how well they are able to reconstruct held-out data from the dictionary, as well as an in-the-wild dataset based on sign language videos from YouTube. These data are characterized by less formal and more expressive signing than the dictionary items. Results show that the isolated sign model yields considerably higher reconstruction loss for the YouTube dataset, while the sentence model performs the best on this data. Further, an analysis of codebook usage reveals that the set of codes used by isolated signs and sentences differ significantly. In order to further understand the different characters of the datasets, we carry out an analysis of the velocity profiles, which reveals that signing data in-the-wild has a much higher average velocity than dictionary signs and sentences. We believe these differences also explain the large differences in reconstruction loss observed.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2024
Keywords
Pose Codebook, Representation Learning, sign language data, VQ-VAE
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-350726 (URN)2-s2.0-85197480349 (Scopus ID)
Conference
11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024
Projects
signbot
Note

Part of ISBN 9782493814302

QC 20240719

Available from: 2024-07-17 Created: 2024-07-17 Last updated: 2024-10-23Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 1952-1964Conference paper, Published paper (Refereed)
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-355103 (URN)
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projects
bodytalk
Note

QC 20241022

Available from: 2024-10-22 Created: 2024-10-22 Last updated: 2024-10-22Bibliographically approved
Werner, A. W., Beskow, J. & Deichler, A. (2024). Gesture Evaluation in Virtual Reality. In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction: . Paper presented at 26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024 (pp. 156-164). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Gesture Evaluation in Virtual Reality
2024 (English)In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2024, p. 156-164Conference paper, Published paper (Refereed)
Abstract [en]

Gestures play a crucial role in human communication, enhancing interpersonal interactions through non-verbal expression. Burgeoning technology allows virtual avatars to leverage communicative gestures to enhance their life-likeness and communication quality with AI-generated gestures. Traditionally, evaluations of AI-generated gestures have been confined to 2D settings. However, Virtual Reality (VR) offers an immersive alternative with the potential to affect the perception of virtual gestures. This paper introduces a novel evaluation approach for computer-generated gestures, investigating the impact of a fully immersive environment compared to a traditional 2D setting. The goal is to find the differences, benefits, and drawbacks of the two alternatives. Furthermore, the study also aims to investigate three gesture generation algorithms submitted to the 2023 GENEA Challenge and evaluate their performance in the two virtual settings. Experiments showed that the VR setting has an impact on the rating of generated gestures. Participants tended to rate gestures observed in VR slightly higher on average than in 2D. Furthermore, the results of the study showed that the generation models used for the study had a consistent ranking. However, the setting had a limited impact on the models' performance, having a bigger impact on the perception of 'true movement' which had higher ratings in VR than in 2D.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
dyadic interaction, embodied conversational agents, evaluation paradigms, gesture generation, virtual reality
National Category
Human Computer Interaction General Language Studies and Linguistics Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-357898 (URN)10.1145/3686215.3688821 (DOI)001429038200033 ()2-s2.0-85211184881 (Scopus ID)
Conference
26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024
Note

Part of ISBN 9798400704635

QC 20250114

Available from: 2024-12-19 Created: 2024-12-19 Last updated: 2025-03-24Bibliographically approved
Deichler, A., Alexanderson, S. & Beskow, J. (2024). Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents. In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024: . Paper presented at 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024. Association for Computing Machinery (ACM), Article ID 42.
Open this publication in new window or tab >>Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
2024 (English)In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, Association for Computing Machinery (ACM) , 2024, article id 42Conference paper, Published paper (Refereed)
Abstract [en]

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents’ non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Co-speech gesture, Deictic gestures, Gesture generation, Situated virtual agents, Synthetic data
National Category
Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-359256 (URN)10.1145/3652988.3673936 (DOI)001441957400042 ()2-s2.0-85215524347 (Scopus ID)
Conference
24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024
Note

Part of ISBN 9798400706257

QC 20250203

Available from: 2025-01-29 Created: 2025-01-29 Last updated: 2025-04-30Bibliographically approved
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Show others...
2024 (English)In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 11341-11345Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Note

Part of ISBN 9798350344851

QC 20240716

Available from: 2024-07-16 Created: 2024-07-16 Last updated: 2025-03-26Bibliographically approved
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Open this publication in new window or tab >>Prosodic characteristics of English-accented Swedish neural TTS
Show others...
2024 (English)In: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, p. 1035-1039Conference paper, Published paper (Refereed)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Place, publisher, year, edition, pages
Leiden, The Netherlands: International Speech Communication Association, 2024
Keywords
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
National Category
Humanities and the Arts General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)
Conference
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projects
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Funder
Vinnova, (2018-02427
Note

QC 20240705

Available from: 2024-07-03 Created: 2024-07-03 Last updated: 2024-07-05Bibliographically approved
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Open this publication in new window or tab >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Show others...
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2285-2289Conference paper, Published paper (Refereed)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)2-s2.0-85214793947 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-02-25Bibliographically approved
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Unified speech and gesture synthesis using flow matching
Show others...
2024 (English)In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 8220-8224Conference paper, Published paper (Refereed)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
National Category
Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Note

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Available from: 2025-04-02 Created: 2025-04-02 Last updated: 2025-04-09Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1399-6604

Search in DiVA

Show all publications