kth.sePublikationer KTH
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (10 of 182) Visa alla publikationer
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI. In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings: . Paper presented at 16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024 (pp. 290-297). Springer Nature
Öppna denna publikation i ny flik eller fönster >>A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI
2025 (Engelska)Ingår i: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings, Springer Nature , 2025, s. 290-297Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with real-time human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions. Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audio-visual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2025
Nyckelord
Dialogue system, Emotions, Situational Context
Nationell ämneskategori
Språkbehandling och datorlingvistik Människa-datorinteraktion (interaktionsdesign) Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-362497 (URN)10.1007/978-981-96-3519-1_27 (DOI)1531735400027 ()2-s2.0-105002141806 (Scopus ID)
Konferens
16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024
Anmärkning

Part of ISBN 9789819635184

QC 20250424

Tillgänglig från: 2025-04-16 Skapad: 2025-04-16 Senast uppdaterad: 2025-11-28Bibliografiskt granskad
Tånnander, C., House, D., Beskow, J. & Edlund, J. (2025). Intrasentential English in Swedish TTS: perceived English-accentedness. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 1638-1642). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Intrasentential English in Swedish TTS: perceived English-accentedness
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 1638-1642Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
controllable TTS, mixed language, read speech
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-372797 (URN)10.21437/Interspeech.2025-762 (DOI)2-s2.0-105020040227 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251118

Tillgänglig från: 2025-11-18 Skapad: 2025-11-18 Senast uppdaterad: 2025-11-18Bibliografiskt granskad
Moell, B., Farestam, F. & Beskow, J. (2025). Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain. Frontiers in Artificial Intelligence, 8, Article ID 1557920.
Öppna denna publikation i ny flik eller fönster >>Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain
2025 (Engelska)Ingår i: Frontiers in Artificial Intelligence, E-ISSN 2624-8212, Vol. 8, artikel-id 1557920Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Introduction: We present the Swedish Medical LLM Benchmark (SMLB), an evaluation framework for assessing large language models (LLMs) in the Swedish medical domain.

Method: The SMLB addresses the lack of language-specific, clinically relevant benchmarks by incorporating four datasets: translated PubMedQA questions, Swedish Medical Exams, Emergency Medicine scenarios, and General Medicine cases.

Result: Our evaluation of 18 state-of-the-art LLMs reveals GPT-4-turbo, Claude- 3.5 (October 2023), and the o3model as top performers, demonstrating a strong alignment between medical reasoning and general language understanding capabilities. Hybrid systems incorporating retrieval-augmented generation (RAG) improved accuracy for clinical knowledge questions, highlighting promising directions for safe implementation.

Discussion: The SMLB provides not only an evaluation tool but also reveals fundamental insights about LLM capabilities and limitations in Swedish healthcare applications, including significant performance variations between models. By open-sourcing the benchmark, we enable transparent assessment of medical LLMs while promoting responsible development through community-driven refinement. This study emphasizes the critical need for rigorous evaluation frameworks as LLMs become increasingly integrated into clinical workflows, particularly in non-English medical contexts where linguistic and cultural specificity are paramount.

 

Ort, förlag, år, upplaga, sidor
Frontiers Media SA, 2025
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-371731 (URN)10.3389/frai.2025.1557920 (DOI)001536176500001 ()40718621 (PubMedID)2-s2.0-105011480129 (Scopus ID)
Anmärkning

QC 20251019

Tillgänglig från: 2025-10-17 Skapad: 2025-10-17 Senast uppdaterad: 2025-11-13Bibliografiskt granskad
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2165-2169). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2165-2169Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We present an investigation into adaptable speech synthesis for noisy environments. Leveraging a zero-shot TTS we synthesized a corpus of 1,200 speech samples from 100 sentences of varying complexity, each generated at six distinct levels of vocal effort. To simulate realistic listening conditions, the synthesized speech is merged with environmental noise recordings from a diverse range of indoor and transportation settings at nine different signal-to-noise ratios. We assess the intelligibility of the resulting noisy speech using the ASR word error rates across conditions. Additionally, the input text was evaluated using four metrics on sentence complexity and word predictability. A number of regression models that used noise type, SNR, vocal effort and text as input were trained to predict ASR WER. Results show that increased vocal effort improves intelligibility, with benefits up to 30% in adverse conditions, most most pronounced in environments with competing speech at low SNRs.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
noisy environments, speech adaptation, speech intelligibility, speech synthesis
Nationell ämneskategori
Språkbehandling och datorlingvistik Signalbehandling Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-372805 (URN)10.21437/Interspeech.2025-2787 (DOI)2-s2.0-105020064005 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251113

Tillgänglig från: 2025-11-13 Skapad: 2025-11-13 Senast uppdaterad: 2025-11-13Bibliografiskt granskad
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
Nationell ämneskategori
Språkbehandling och datorlingvistik Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)1331850102192 ()2-s2.0-85214785956 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Anmärkning

QC 20250128

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-11-28Bibliografiskt granskad
Malmberg, F., Klezovich, A., Mesch, J. & Beskow, J. (2024). Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data. In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024: . Paper presented at 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024 (pp. 219-224). Association for Computational Linguistics (ACL)
Öppna denna publikation i ny flik eller fönster >>Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data
2024 (Engelska)Ingår i: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Association for Computational Linguistics (ACL) , 2024, s. 219-224Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Unsupervised representation learning offers a promising way of utilising large unannotated sign language resources found on the Internet. In this paper, a representation learning model, VQ-VAE, is trained to learn a codebook of motion primitives from sign language data. For training, we use isolated signs and sentences from a sign language dictionary. Three models are trained: one on isolated signs, one on sentences, and one mixed model. We test these models by comparing how well they are able to reconstruct held-out data from the dictionary, as well as an in-the-wild dataset based on sign language videos from YouTube. These data are characterized by less formal and more expressive signing than the dictionary items. Results show that the isolated sign model yields considerably higher reconstruction loss for the YouTube dataset, while the sentence model performs the best on this data. Further, an analysis of codebook usage reveals that the set of codes used by isolated signs and sentences differ significantly. In order to further understand the different characters of the datasets, we carry out an analysis of the velocity profiles, which reveals that signing data in-the-wild has a much higher average velocity than dictionary signs and sentences. We believe these differences also explain the large differences in reconstruction loss observed.

Ort, förlag, år, upplaga, sidor
Association for Computational Linguistics (ACL), 2024
Nyckelord
Pose Codebook, Representation Learning, sign language data, VQ-VAE
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-350726 (URN)2-s2.0-85197480349 (Scopus ID)
Konferens
11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024
Projekt
signbot
Anmärkning

Part of ISBN 9782493814302

QC 20240719

Tillgänglig från: 2024-07-17 Skapad: 2024-07-17 Senast uppdaterad: 2024-10-23Bibliografiskt granskad
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Öppna denna publikation i ny flik eller fönster >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Visa övriga...
2024 (Engelska)Ingår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferensbidrag, Publicerat paper (Refereegranskat)
Nationell ämneskategori
Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferens
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projekt
bodytalk
Anmärkning

QC 20241022

Tillgänglig från: 2024-10-22 Skapad: 2024-10-22 Senast uppdaterad: 2024-10-22Bibliografiskt granskad
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Visa övriga...
2024 (Engelska)Ingår i: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 1952-1964Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Nyckelord
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
Nationell ämneskategori
Signalbehandling Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Konferens
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Anmärkning

Part of ISBN 9798350365474

QC 20250715

Tillgänglig från: 2025-07-15 Skapad: 2025-07-15 Senast uppdaterad: 2025-08-13Bibliografiskt granskad
Werner, A. W., Beskow, J. & Deichler, A. (2024). Gesture Evaluation in Virtual Reality. In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction: . Paper presented at 26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024 (pp. 156-164). Association for Computing Machinery (ACM)
Öppna denna publikation i ny flik eller fönster >>Gesture Evaluation in Virtual Reality
2024 (Engelska)Ingår i: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2024, s. 156-164Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Gestures play a crucial role in human communication, enhancing interpersonal interactions through non-verbal expression. Burgeoning technology allows virtual avatars to leverage communicative gestures to enhance their life-likeness and communication quality with AI-generated gestures. Traditionally, evaluations of AI-generated gestures have been confined to 2D settings. However, Virtual Reality (VR) offers an immersive alternative with the potential to affect the perception of virtual gestures. This paper introduces a novel evaluation approach for computer-generated gestures, investigating the impact of a fully immersive environment compared to a traditional 2D setting. The goal is to find the differences, benefits, and drawbacks of the two alternatives. Furthermore, the study also aims to investigate three gesture generation algorithms submitted to the 2023 GENEA Challenge and evaluate their performance in the two virtual settings. Experiments showed that the VR setting has an impact on the rating of generated gestures. Participants tended to rate gestures observed in VR slightly higher on average than in 2D. Furthermore, the results of the study showed that the generation models used for the study had a consistent ranking. However, the setting had a limited impact on the models' performance, having a bigger impact on the perception of 'true movement' which had higher ratings in VR than in 2D.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2024
Nyckelord
dyadic interaction, embodied conversational agents, evaluation paradigms, gesture generation, virtual reality
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign) Jämförande språkvetenskap och allmän lingvistik Annan teknik
Identifikatorer
urn:nbn:se:kth:diva-357898 (URN)10.1145/3686215.3688821 (DOI)001429038200033 ()2-s2.0-85211184881 (Scopus ID)
Konferens
26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024
Anmärkning

Part of ISBN 9798400704635

QC 20250114

Tillgänglig från: 2024-12-19 Skapad: 2024-12-19 Senast uppdaterad: 2025-03-24Bibliografiskt granskad
Deichler, A., Alexanderson, S. & Beskow, J. (2024). Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents. In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024: . Paper presented at 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024. Association for Computing Machinery (ACM), Article ID 42.
Öppna denna publikation i ny flik eller fönster >>Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
2024 (Engelska)Ingår i: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, Association for Computing Machinery (ACM) , 2024, artikel-id 42Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents’ non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2024
Nyckelord
Co-speech gesture, Deictic gestures, Gesture generation, Situated virtual agents, Synthetic data
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign) Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-359256 (URN)10.1145/3652988.3673936 (DOI)001441957400042 ()2-s2.0-85215524347 (Scopus ID)
Konferens
24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024
Anmärkning

Part of ISBN 9798400706257

QC 20250203

Tillgänglig från: 2025-01-29 Skapad: 2025-01-29 Senast uppdaterad: 2025-04-30Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-1399-6604

Sök vidare i DiVA

Visa alla publikationer