kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 180) Show all publications
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI. In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings: . Paper presented at 16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024 (pp. 290-297). Springer Nature
Open this publication in new window or tab >>A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI
2025 (English)In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings, Springer Nature , 2025, p. 290-297Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with real-time human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions. Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audio-visual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

Place, publisher, year, edition, pages
Springer Nature, 2025
Keywords
Dialogue system, Emotions, Situational Context
National Category
Natural Language Processing Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-362497 (URN)10.1007/978-981-96-3519-1_27 (DOI)2-s2.0-105002141806 (Scopus ID)
Conference
16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024
Note

Part of ISBN 9789819635184

QC 20250424

Available from: 2025-04-16 Created: 2025-04-16 Last updated: 2025-04-24Bibliographically approved
Moell, B., Farestam, F. & Beskow, J. (2025). Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain. Frontiers in Artificial Intelligence, 8, Article ID 1557920.
Open this publication in new window or tab >>Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain
2025 (English)In: Frontiers in Artificial Intelligence, E-ISSN 2624-8212, Vol. 8, article id 1557920Article in journal (Refereed) Published
Abstract [en]

Introduction: We present the Swedish Medical LLM Benchmark (SMLB), an evaluation framework for assessing large language models (LLMs) in the Swedish medical domain.

Method: The SMLB addresses the lack of language-specific, clinically relevant benchmarks by incorporating four datasets: translated PubMedQA questions, Swedish Medical Exams, Emergency Medicine scenarios, and General Medicine cases.

Result: Our evaluation of 18 state-of-the-art LLMs reveals GPT-4-turbo, Claude- 3.5 (October 2023), and the o3model as top performers, demonstrating a strong alignment between medical reasoning and general language understanding capabilities. Hybrid systems incorporating retrieval-augmented generation (RAG) improved accuracy for clinical knowledge questions, highlighting promising directions for safe implementation.

Discussion: The SMLB provides not only an evaluation tool but also reveals fundamental insights about LLM capabilities and limitations in Swedish healthcare applications, including significant performance variations between models. By open-sourcing the benchmark, we enable transparent assessment of medical LLMs while promoting responsible development through community-driven refinement. This study emphasizes the critical need for rigorous evaluation frameworks as LLMs become increasingly integrated into clinical workflows, particularly in non-English medical contexts where linguistic and cultural specificity are paramount.

 

Place, publisher, year, edition, pages
Frontiers Media SA, 2025
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-371731 (URN)10.3389/frai.2025.1557920 (DOI)001536176500001 ()40718621 (PubMedID)2-s2.0-105011480129 (Scopus ID)
Note

QC 20251019

Available from: 2025-10-17 Created: 2025-10-17 Last updated: 2025-10-19Bibliographically approved
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Open this publication in new window or tab >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)2-s2.0-85214785956 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-28Bibliographically approved
Malmberg, F., Klezovich, A., Mesch, J. & Beskow, J. (2024). Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data. In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024: . Paper presented at 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024 (pp. 219-224). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data
2024 (English)In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Association for Computational Linguistics (ACL) , 2024, p. 219-224Conference paper, Published paper (Refereed)
Abstract [en]

Unsupervised representation learning offers a promising way of utilising large unannotated sign language resources found on the Internet. In this paper, a representation learning model, VQ-VAE, is trained to learn a codebook of motion primitives from sign language data. For training, we use isolated signs and sentences from a sign language dictionary. Three models are trained: one on isolated signs, one on sentences, and one mixed model. We test these models by comparing how well they are able to reconstruct held-out data from the dictionary, as well as an in-the-wild dataset based on sign language videos from YouTube. These data are characterized by less formal and more expressive signing than the dictionary items. Results show that the isolated sign model yields considerably higher reconstruction loss for the YouTube dataset, while the sentence model performs the best on this data. Further, an analysis of codebook usage reveals that the set of codes used by isolated signs and sentences differ significantly. In order to further understand the different characters of the datasets, we carry out an analysis of the velocity profiles, which reveals that signing data in-the-wild has a much higher average velocity than dictionary signs and sentences. We believe these differences also explain the large differences in reconstruction loss observed.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2024
Keywords
Pose Codebook, Representation Learning, sign language data, VQ-VAE
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-350726 (URN)2-s2.0-85197480349 (Scopus ID)
Conference
11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024
Projects
signbot
Note

Part of ISBN 9782493814302

QC 20240719

Available from: 2024-07-17 Created: 2024-07-17 Last updated: 2024-10-23Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 1952-1964Conference paper, Published paper (Refereed)
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-355103 (URN)
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projects
bodytalk
Note

QC 20241022

Available from: 2024-10-22 Created: 2024-10-22 Last updated: 2024-10-22Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1952-1964Conference paper, Published paper (Refereed)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
National Category
Signal Processing Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Note

Part of ISBN 9798350365474

QC 20250715

Available from: 2025-07-15 Created: 2025-07-15 Last updated: 2025-08-13Bibliographically approved
Werner, A. W., Beskow, J. & Deichler, A. (2024). Gesture Evaluation in Virtual Reality. In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction: . Paper presented at 26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024 (pp. 156-164). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Gesture Evaluation in Virtual Reality
2024 (English)In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2024, p. 156-164Conference paper, Published paper (Refereed)
Abstract [en]

Gestures play a crucial role in human communication, enhancing interpersonal interactions through non-verbal expression. Burgeoning technology allows virtual avatars to leverage communicative gestures to enhance their life-likeness and communication quality with AI-generated gestures. Traditionally, evaluations of AI-generated gestures have been confined to 2D settings. However, Virtual Reality (VR) offers an immersive alternative with the potential to affect the perception of virtual gestures. This paper introduces a novel evaluation approach for computer-generated gestures, investigating the impact of a fully immersive environment compared to a traditional 2D setting. The goal is to find the differences, benefits, and drawbacks of the two alternatives. Furthermore, the study also aims to investigate three gesture generation algorithms submitted to the 2023 GENEA Challenge and evaluate their performance in the two virtual settings. Experiments showed that the VR setting has an impact on the rating of generated gestures. Participants tended to rate gestures observed in VR slightly higher on average than in 2D. Furthermore, the results of the study showed that the generation models used for the study had a consistent ranking. However, the setting had a limited impact on the models' performance, having a bigger impact on the perception of 'true movement' which had higher ratings in VR than in 2D.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
dyadic interaction, embodied conversational agents, evaluation paradigms, gesture generation, virtual reality
National Category
Human Computer Interaction General Language Studies and Linguistics Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-357898 (URN)10.1145/3686215.3688821 (DOI)001429038200033 ()2-s2.0-85211184881 (Scopus ID)
Conference
26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024
Note

Part of ISBN 9798400704635

QC 20250114

Available from: 2024-12-19 Created: 2024-12-19 Last updated: 2025-03-24Bibliographically approved
Deichler, A., Alexanderson, S. & Beskow, J. (2024). Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents. In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024: . Paper presented at 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024. Association for Computing Machinery (ACM), Article ID 42.
Open this publication in new window or tab >>Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
2024 (English)In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, Association for Computing Machinery (ACM) , 2024, article id 42Conference paper, Published paper (Refereed)
Abstract [en]

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents’ non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Co-speech gesture, Deictic gestures, Gesture generation, Situated virtual agents, Synthetic data
National Category
Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-359256 (URN)10.1145/3652988.3673936 (DOI)001441957400042 ()2-s2.0-85215524347 (Scopus ID)
Conference
24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024
Note

Part of ISBN 9798400706257

QC 20250203

Available from: 2025-01-29 Created: 2025-01-29 Last updated: 2025-04-30Bibliographically approved
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Show others...
2024 (English)In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 11341-11345Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Note

Part of ISBN 9798350344851

QC 20240716

Available from: 2024-07-16 Created: 2024-07-16 Last updated: 2025-08-13Bibliographically approved
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Open this publication in new window or tab >>Prosodic characteristics of English-accented Swedish neural TTS
Show others...
2024 (English)In: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, p. 1035-1039Conference paper, Published paper (Refereed)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Place, publisher, year, edition, pages
Leiden, The Netherlands: International Speech Communication Association, 2024
Keywords
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
National Category
Humanities and the Arts General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)2-s2.0-105008058763 (Scopus ID)
Conference
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projects
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Funder
Vinnova, (2018-02427
Note

QC 20240705

Available from: 2024-07-03 Created: 2024-07-03 Last updated: 2025-07-01Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1399-6604

Search in DiVA

Show all publications