Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 182) Visa alla publikasjoner
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI. In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings: . Paper presented at 16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024 (pp. 290-297). Springer Nature
Åpne denne publikasjonen i ny fane eller vindu >>A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI
2025 (engelsk)Inngår i: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings, Springer Nature , 2025, s. 290-297Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with real-time human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions. Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audio-visual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

sted, utgiver, år, opplag, sider
Springer Nature, 2025
Emneord
Dialogue system, Emotions, Situational Context
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-362497 (URN)10.1007/978-981-96-3519-1_27 (DOI)001531735400027 ()2-s2.0-105002141806 (Scopus ID)
Konferanse
16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024
Merknad

Part of ISBN 9789819635184

QC 20250424

Tilgjengelig fra: 2025-04-16 Laget: 2025-04-16 Sist oppdatert: 2025-12-08bibliografisk kontrollert
Tånnander, C., House, D., Beskow, J. & Edlund, J. (2025). Intrasentential English in Swedish TTS: perceived English-accentedness. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 1638-1642). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Intrasentential English in Swedish TTS: perceived English-accentedness
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 1638-1642Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

English names and expressions are frequently inserted into Swedish text. Humans intuitively adjust the degree of English pronunciation of such insertions. This work aims at a Swedish text-to-speech synthesis (TTS) capable of similar controlled adaptation. We focus on two key aspects: (1) the development of a TTS system with controllable degrees of perceived English-accentedness (PEA); and (2) the exploration of human preferences related to PEA. We trained a Swedish TTS voice on Swedish and English sentences with a conditioning parameter for language (English-accentedness, EA) on a scale from 0 to 1, and estimated a psychometric mapping of the perceived effect of EA to a perceptual scale (PEA) through perception tests. PEA was then used in Best-Worst listening tests presenting English insertions with varying PEA. The results confirm the effectiveness of the training and the PEA scale, and that listener preferences change with different insertions.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
controllable TTS, mixed language, read speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372797 (URN)10.21437/Interspeech.2025-762 (DOI)2-s2.0-105020040227 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251118

Tilgjengelig fra: 2025-11-18 Laget: 2025-11-18 Sist oppdatert: 2025-11-18bibliografisk kontrollert
Moell, B., Farestam, F. & Beskow, J. (2025). Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain. Frontiers in Artificial Intelligence, 8, Article ID 1557920.
Åpne denne publikasjonen i ny fane eller vindu >>Swedish Medical LLM Benchmark: Development and evaluation of a framework for assessing large language models in the Swedish medical domain
2025 (engelsk)Inngår i: Frontiers in Artificial Intelligence, E-ISSN 2624-8212, Vol. 8, artikkel-id 1557920Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Introduction: We present the Swedish Medical LLM Benchmark (SMLB), an evaluation framework for assessing large language models (LLMs) in the Swedish medical domain.

Method: The SMLB addresses the lack of language-specific, clinically relevant benchmarks by incorporating four datasets: translated PubMedQA questions, Swedish Medical Exams, Emergency Medicine scenarios, and General Medicine cases.

Result: Our evaluation of 18 state-of-the-art LLMs reveals GPT-4-turbo, Claude- 3.5 (October 2023), and the o3model as top performers, demonstrating a strong alignment between medical reasoning and general language understanding capabilities. Hybrid systems incorporating retrieval-augmented generation (RAG) improved accuracy for clinical knowledge questions, highlighting promising directions for safe implementation.

Discussion: The SMLB provides not only an evaluation tool but also reveals fundamental insights about LLM capabilities and limitations in Swedish healthcare applications, including significant performance variations between models. By open-sourcing the benchmark, we enable transparent assessment of medical LLMs while promoting responsible development through community-driven refinement. This study emphasizes the critical need for rigorous evaluation frameworks as LLMs become increasingly integrated into clinical workflows, particularly in non-English medical contexts where linguistic and cultural specificity are paramount.

 

sted, utgiver, år, opplag, sider
Frontiers Media SA, 2025
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-371731 (URN)10.3389/frai.2025.1557920 (DOI)001536176500001 ()40718621 (PubMedID)2-s2.0-105011480129 (Scopus ID)
Merknad

QC 20251019

Tilgjengelig fra: 2025-10-17 Laget: 2025-10-17 Sist oppdatert: 2025-11-13bibliografisk kontrollert
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2165-2169). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2165-2169Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We present an investigation into adaptable speech synthesis for noisy environments. Leveraging a zero-shot TTS we synthesized a corpus of 1,200 speech samples from 100 sentences of varying complexity, each generated at six distinct levels of vocal effort. To simulate realistic listening conditions, the synthesized speech is merged with environmental noise recordings from a diverse range of indoor and transportation settings at nine different signal-to-noise ratios. We assess the intelligibility of the resulting noisy speech using the ASR word error rates across conditions. Additionally, the input text was evaluated using four metrics on sentence complexity and word predictability. A number of regression models that used noise type, SNR, vocal effort and text as input were trained to predict ASR WER. Results show that increased vocal effort improves intelligibility, with benefits up to 30% in adverse conditions, most most pronounced in environments with competing speech at low SNRs.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
noisy environments, speech adaptation, speech intelligibility, speech synthesis
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372805 (URN)10.21437/Interspeech.2025-2787 (DOI)2-s2.0-105020064005 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251113

Tilgjengelig fra: 2025-11-13 Laget: 2025-11-13 Sist oppdatert: 2025-11-13bibliografisk kontrollert
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250128

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-08bibliografisk kontrollert
Malmberg, F., Klezovich, A., Mesch, J. & Beskow, J. (2024). Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data. In: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024: . Paper presented at 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024 (pp. 219-224). Association for Computational Linguistics (ACL)
Åpne denne publikasjonen i ny fane eller vindu >>Exploring Latent Sign Language Representations with Isolated Signs, Sentences and In-the-Wild Data
2024 (engelsk)Inngår i: 11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Association for Computational Linguistics (ACL) , 2024, s. 219-224Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Unsupervised representation learning offers a promising way of utilising large unannotated sign language resources found on the Internet. In this paper, a representation learning model, VQ-VAE, is trained to learn a codebook of motion primitives from sign language data. For training, we use isolated signs and sentences from a sign language dictionary. Three models are trained: one on isolated signs, one on sentences, and one mixed model. We test these models by comparing how well they are able to reconstruct held-out data from the dictionary, as well as an in-the-wild dataset based on sign language videos from YouTube. These data are characterized by less formal and more expressive signing than the dictionary items. Results show that the isolated sign model yields considerably higher reconstruction loss for the YouTube dataset, while the sentence model performs the best on this data. Further, an analysis of codebook usage reveals that the set of codes used by isolated signs and sentences differ significantly. In order to further understand the different characters of the datasets, we carry out an analysis of the velocity profiles, which reveals that signing data in-the-wild has a much higher average velocity than dictionary signs and sentences. We believe these differences also explain the large differences in reconstruction loss observed.

sted, utgiver, år, opplag, sider
Association for Computational Linguistics (ACL), 2024
Emneord
Pose Codebook, Representation Learning, sign language data, VQ-VAE
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-350726 (URN)2-s2.0-85197480349 (Scopus ID)
Konferanse
11th Workshop on the Representation and Processing of Sign Languages: Evaluation of Sign Language Resources, sign-lang@LREC-COLING 2024, Torino, Italy, May 25 2024
Prosjekter
signbot
Merknad

Part of ISBN 9782493814302

QC 20240719

Tilgjengelig fra: 2024-07-17 Laget: 2024-07-17 Sist oppdatert: 2024-10-23bibliografisk kontrollert
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Åpne denne publikasjonen i ny fane eller vindu >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Vise andre…
2024 (engelsk)Inngår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferansepaper, Publicerat paper (Fagfellevurdert)
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferanse
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Prosjekter
bodytalk
Merknad

QC 20241022

Tilgjengelig fra: 2024-10-22 Laget: 2024-10-22 Sist oppdatert: 2024-10-22bibliografisk kontrollert
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Vise andre…
2024 (engelsk)Inngår i: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 1952-1964Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Emneord
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Konferanse
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Merknad

Part of ISBN 9798350365474

QC 20250715

Tilgjengelig fra: 2025-07-15 Laget: 2025-07-15 Sist oppdatert: 2025-08-13bibliografisk kontrollert
Werner, A. W., Beskow, J. & Deichler, A. (2024). Gesture Evaluation in Virtual Reality. In: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction: . Paper presented at 26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024 (pp. 156-164). Association for Computing Machinery (ACM)
Åpne denne publikasjonen i ny fane eller vindu >>Gesture Evaluation in Virtual Reality
2024 (engelsk)Inngår i: ICMI Companion 2024 - Companion Publication of the 26th International Conference on Multimodal Interaction, Association for Computing Machinery (ACM) , 2024, s. 156-164Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Gestures play a crucial role in human communication, enhancing interpersonal interactions through non-verbal expression. Burgeoning technology allows virtual avatars to leverage communicative gestures to enhance their life-likeness and communication quality with AI-generated gestures. Traditionally, evaluations of AI-generated gestures have been confined to 2D settings. However, Virtual Reality (VR) offers an immersive alternative with the potential to affect the perception of virtual gestures. This paper introduces a novel evaluation approach for computer-generated gestures, investigating the impact of a fully immersive environment compared to a traditional 2D setting. The goal is to find the differences, benefits, and drawbacks of the two alternatives. Furthermore, the study also aims to investigate three gesture generation algorithms submitted to the 2023 GENEA Challenge and evaluate their performance in the two virtual settings. Experiments showed that the VR setting has an impact on the rating of generated gestures. Participants tended to rate gestures observed in VR slightly higher on average than in 2D. Furthermore, the results of the study showed that the generation models used for the study had a consistent ranking. However, the setting had a limited impact on the models' performance, having a bigger impact on the perception of 'true movement' which had higher ratings in VR than in 2D.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2024
Emneord
dyadic interaction, embodied conversational agents, evaluation paradigms, gesture generation, virtual reality
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-357898 (URN)10.1145/3686215.3688821 (DOI)001429038200033 ()2-s2.0-85211184881 (Scopus ID)
Konferanse
26th International Conference on Multimodal Interaction, ICMI Companion 2024, San Jose, Costa Rica, Nov 4 2024 - Nov 8 2024
Merknad

Part of ISBN 9798400704635

QC 20250114

Tilgjengelig fra: 2024-12-19 Laget: 2024-12-19 Sist oppdatert: 2025-03-24bibliografisk kontrollert
Deichler, A., Alexanderson, S. & Beskow, J. (2024). Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents. In: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024: . Paper presented at 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024. Association for Computing Machinery (ACM), Article ID 42.
Åpne denne publikasjonen i ny fane eller vindu >>Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents
2024 (engelsk)Inngår i: Proceedings of the 24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, Association for Computing Machinery (ACM) , 2024, artikkel-id 42Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents’ non-verbal behaviors, specifically gestures. Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void. Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis. We introduce a novel synthetic gesture dataset tailored for this purpose. This development represents a critical step toward creating embodied conversational agents that interact more naturally with their environment and users.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2024
Emneord
Co-speech gesture, Deictic gestures, Gesture generation, Situated virtual agents, Synthetic data
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-359256 (URN)10.1145/3652988.3673936 (DOI)001441957400042 ()2-s2.0-85215524347 (Scopus ID)
Konferanse
24th ACM International Conference on Intelligent Virtual Agents, IVA 2024, co-located with the Affective Computing and Intelligent Interaction 2024 Conference, ACII 2024, Glasgow, United Kingdom of Great Britain and Northern Ireland, September 16-19, 2024
Merknad

Part of ISBN 9798400706257

QC 20250203

Tilgjengelig fra: 2025-01-29 Laget: 2025-01-29 Sist oppdatert: 2025-04-30bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-1399-6604