kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (10 of 182) Show all publications
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI. In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings: . Paper presented at 16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024 (pp. 290-297). Springer Nature
Open this publication in new window or tab >>A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI
2025 (English)In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings, Springer Nature , 2025, p. 290-297Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with real-time human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions. Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audio-visual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

Place, publisher, year, edition, pages
Springer Nature, 2025
Keywords
Dialogue system, Emotions, Situational Context
National Category
Natural Language Processing Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-362497 (URN)10.1007/978-981-96-3519-1_27 (DOI)001531735400027 ()2-s2.0-105002141806 (Scopus ID)
Conference
16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024
Note

Part of ISBN 9789819635184

QC 20250424

Available from: 2025-04-16 Created: 2025-04-16 Last updated: 2025-12-08Bibliographically approved
Tånnander, C., House, D., Beskow, J. & Edlund, J. (2025). Intrasentential English in Swedish TTS: perceived English-accentedness. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 1638-1642). International Speech Communication Association
Open this publication in new window or tab >>Intrasentential English in Swedish TTS: perceived English-accentedness
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 1638-1642Conference paper, Published paper (Refereed)
Abstract [en]

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
controllable TTS, mixed language, read speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-372797 (URN)10.21437/Interspeech.2025-762 (DOI)2-s2.0-105020040227 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251118

Available from: 2025-11-18 Created: 2025-11-18 Last updated: 2025-11-18Bibliographically approved
Moell, B., Farestam, F. & Beskow, J. (2025). Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain. Frontiers in Artificial Intelligence, 8, Article ID 1557920.
Open this publication in new window or tab >>Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain
2025 (English)In: Frontiers in Artificial Intelligence, E-ISSN 2624-8212, Vol. 8, article id 1557920Article in journal (Refereed) Published
Abstract [en]

Introduction: We present the Swedish Medical LLM Benchmark (SMLB), an evaluation framework for assessing large language models (LLMs) in the Swedish medical domain.

Method: The SMLB addresses the lack of language-specific, clinically relevant benchmarks by incorporating four datasets: translated PubMedQA questions, Swedish Medical Exams, Emergency Medicine scenarios, and General Medicine cases.

Result: Our evaluation of 18 state-of-the-art LLMs reveals GPT-4-turbo, Claude- 3.5 (October 2023), and the o3model as top performers, demonstrating a strong alignment between medical reasoning and general language understanding capabilities. Hybrid systems incorporating retrieval-augmented generation (RAG) improved accuracy for clinical knowledge questions, highlighting promising directions for safe implementation.

Discussion: The SMLB provides not only an evaluation tool but also reveals fundamental insights about LLM capabilities and limitations in Swedish healthcare applications, including significant performance variations between models. By open-sourcing the benchmark, we enable transparent assessment of medical LLMs while promoting responsible development through community-driven refinement. This study emphasizes the critical need for rigorous evaluation frameworks as LLMs become increasingly integrated into clinical workflows, particularly in non-English medical contexts where linguistic and cultural specificity are paramount.

 

Place, publisher, year, edition, pages
Frontiers Media SA, 2025
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-371731 (URN)10.3389/frai.2025.1557920 (DOI)001536176500001 ()40718621 (PubMedID)2-s2.0-105011480129 (Scopus ID)
Note

QC 20251019

Available from: 2025-10-17 Created: 2025-10-17 Last updated: 2025-11-13Bibliographically approved
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2165-2169). International Speech Communication Association
Open this publication in new window or tab >>Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2165-2169Conference paper, Published paper (Refereed)
Abstract [en]

We present an investigation into adaptable speech synthesis for noisy environments. Leveraging a zero-shot TTS we synthesized a corpus of 1,200 speech samples from 100 sentences of varying complexity, each generated at six distinct levels of vocal effort. To simulate realistic listening conditions, the synthesized speech is merged with environmental noise recordings from a diverse range of indoor and transportation settings at nine different signal-to-noise ratios. We assess the intelligibility of the resulting noisy speech using the ASR word error rates across conditions. Additionally, the input text was evaluated using four metrics on sentence complexity and word predictability. A number of regression models that used noise type, SNR, vocal effort and text as input were trained to predict ASR WER. Results show that increased vocal effort improves intelligibility, with benefits up to 30% in adverse conditions, most most pronounced in environments with competing speech at low SNRs.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
noisy environments, speech adaptation, speech intelligibility, speech synthesis
National Category
Natural Language Processing Signal Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-372805 (URN)10.21437/Interspeech.2025-2787 (DOI)2-s2.0-105020064005 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251113

Available from: 2025-11-13 Created: 2025-11-13 Last updated: 2025-11-13Bibliographically approved
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Open this publication in new window or tab >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-08Bibliographically approved
Malmberg, F., Klezovich, A., Mesch, J. & Beskow, J. (2024). Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data. In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024: . Paper presented at 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024 (pp. 219-224). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data
2024 (English)In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Association for Computational Linguistics (ACL) , 2024, p. 219-224Conference paper, Published paper (Refereed)
Abstract [en]

Unsupervised representation learning offers a promising way of utilising large unannotated sign language resources found on the Internet. In this paper, a representation learning model, VQ-VAE, is trained to learn a codebook of motion primitives from sign language data. For training, we use isolated signs and sentences from a sign language dictionary. Three models are trained: one on isolated signs, one on sentences, and one mixed model. We test these models by comparing how well they are able to reconstruct held-out data from the dictionary, as well as an in-the-wild dataset based on sign language videos from YouTube. These data are characterized by less formal and more expressive signing than the dictionary items. Results show that the isolated sign model yields considerably higher reconstruction loss for the YouTube dataset, while the sentence model performs the best on this data. Further, an analysis of codebook usage reveals that the set of codes used by isolated signs and sentences differ significantly. In order to further understand the different characters of the datasets, we carry out an analysis of the velocity profiles, which reveals that signing data in-the-wild has a much higher average velocity than dictionary signs and sentences. We believe these differences also explain the large differences in reconstruction loss observed.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2024
Keywords
Pose Codebook, Representation Learning, sign language data, VQ-VAE
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-350726 (URN)2-s2.0-85197480349 (Scopus ID)
Conference
11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024
Projects
signbot
Note

Part of ISBN 9782493814302

QC 20240719

Available from: 2024-07-17 Created: 2024-07-17 Last updated: 2024-10-23Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 1952-1964Conference paper, Published paper (Refereed)
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-355103 (URN)
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projects
bodytalk
Note

QC 20241022

Available from: 2024-10-22 Created: 2024-10-22 Last updated: 2024-10-22Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1952-1964Conference paper, Published paper (Refereed)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
National Category
Signal Processing Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Note

Part of ISBN 9798350365474

QC 20250715

Available from: 2025-07-15 Created: 2025-07-15 Last updated: 2025-08-13Bibliographically approved
Werner, A. W., Beskow, J. & Deichler, A. (2024). Gesture Evaluation in Virtual Reality. In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction: . Paper presented at 26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024 (pp. 156-164). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Gesture Evaluation in Virtual Reality
2024 (English)In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2024, p. 156-164Conference paper, Published paper (Refereed)
Abstract [en]

Gestures play a crucial role in human communication, enhancing interpersonal interactions through non-verbal expression. Burgeoning technology allows virtual avatars to leverage communicative gestures to enhance their life-likeness and communication quality with AI-generated gestures. Traditionally, evaluations of AI-generated gestures have been confined to 2D settings. However, Virtual Reality (VR) offers an immersive alternative with the potential to affect the perception of virtual gestures. This paper introduces a novel evaluation approach for computer-generated gestures, investigating the impact of a fully immersive environment compared to a traditional 2D setting. The goal is to find the differences, benefits, and drawbacks of the two alternatives. Furthermore, the study also aims to investigate three gesture generation algorithms submitted to the 2023 GENEA Challenge and evaluate their performance in the two virtual settings. Experiments showed that the VR setting has an impact on the rating of generated gestures. Participants tended to rate gestures observed in VR slightly higher on average than in 2D. Furthermore, the results of the study showed that the generation models used for the study had a consistent ranking. However, the setting had a limited impact on the models' performance, having a bigger impact on the perception of 'true movement' which had higher ratings in VR than in 2D.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
dyadic interaction, embodied conversational agents, evaluation paradigms, gesture generation, virtual reality
National Category
Human Computer Interaction General Language Studies and Linguistics Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-357898 (URN)10.1145/3686215.3688821 (DOI)001429038200033 ()2-s2.0-85211184881 (Scopus ID)
Conference
26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024
Note

Part of ISBN 9798400704635

QC 20250114

Available from: 2024-12-19 Created: 2024-12-19 Last updated: 2025-03-24Bibliographically approved
Deichler, A., Alexanderson, S. & Beskow, J. (2024). Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents. In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024: . Paper presented at 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024. Association for Computing Machinery (ACM), Article ID 42.
Open this publication in new window or tab >>Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
2024 (English)In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, Association for Computing Machinery (ACM) , 2024, article id 42Conference paper, Published paper (Refereed)
Abstract [en]

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents’ non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Co-speech gesture, Deictic gestures, Gesture generation, Situated virtual agents, Synthetic data
National Category
Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-359256 (URN)10.1145/3652988.3673936 (DOI)001441957400042 ()2-s2.0-85215524347 (Scopus ID)
Conference
24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024
Note

Part of ISBN 9798400706257

QC 20250203

Available from: 2025-01-29 Created: 2025-01-29 Last updated: 2025-04-30Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1399-6604

Search in DiVA

Show all publications