kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
Link to record
Permanent link

Direct link
Gustafsson, Joakim, ProfessorORCID iD iconorcid.org/0000-0002-0397-6442
Alternative names
Publications (10 of 172) Show all publications
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI. In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings: . Paper presented at 16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024 (pp. 290-297). Springer Nature
Open this publication in new window or tab >>A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI
2025 (English)In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings, Springer Nature , 2025, p. 290-297Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with real-time human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions. Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audio-visual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

Place, publisher, year, edition, pages
Springer Nature, 2025
Keywords
Dialogue system, Emotions, Situational Context
National Category
Natural Language Processing Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-362497 (URN)10.1007/978-981-96-3519-1_27 (DOI)001531735400027 ()2-s2.0-105002141806 (Scopus ID)
Conference
16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024
Note

Part of ISBN 9789819635184

QC 20250424

Available from: 2025-04-16 Created: 2025-04-16 Last updated: 2025-12-08Bibliographically approved
Francis, J., Gustafsson, J. & Székely, É. (2025). From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4960-4962). International Speech Communication Association
Open this publication in new window or tab >>From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 4960-4962Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents an Augmentative and Alternative Communication (AAC) approach for minimally verbal children with Autism Spectrum Disorder. Traditional AAC systems use fixed symbol sets and pre-defined Text-to-Speech (TTS) voices, this proposed method leverages text-to-image generation and zero-shot TTS to expand expressive capabilities. Users can create visual symbols for concepts and interests, enabling richer communication. Further, zero-shot TTS allows users to upload or record personalized voices, enabling users to have individualized output. By minimizing reliance on static symbols and voices, this approach aims to increase communicative agency, personal relevance, and social validity, areas often neglected in traditional interventions. Future research will explore long-term effects on communicative skills, user satisfaction, social engagement, and adaptability across various cultural and linguistic settings, aiming to develop more dynamic and personalized AAC solutions.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
AAC, Human-Computer Interaction, Speech Synthesis, TTS
National Category
Natural Language Processing Human Computer Interaction Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-372783 (URN)10.21437/Interspeech.2025-2815 (DOI)2-s2.0-105020070493 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251124

Available from: 2025-11-24 Created: 2025-11-24 Last updated: 2025-11-24Bibliographically approved
Marcinek, L., Irfan, B., Skantze, G., Abelho Pereira, A. T. & Gustafsson, J. (2025). Role of Reasoning in LLM Enjoyment Detection: Evaluation Across Conversational Levels for Human-Robot Interaction. In: Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin (Ed.), Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue: . Paper presented at The 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Avignon, France, Aug 25-27, 2025 (pp. 573-590). SIGDIAL
Open this publication in new window or tab >>Role of Reasoning in LLM Enjoyment Detection: Evaluation Across Conversational Levels for Human-Robot Interaction
Show others...
2025 (English)In: Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue / [ed] Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin, SIGDIAL , 2025, p. 573-590Conference paper, Published paper (Refereed)
Abstract [en]

User enjoyment is central to developing conversational AI systems that can recover from failures and maintain interest over time. However, existing approaches often struggle to detect subtle cues that reflect user experience. Large Language Models (LLMs) with reasoning capabilities have outperformed standard models on various other tasks, suggesting potential benefits for enjoyment detection. This study investigates whether models with reasoning capabilities outperform standard models when assessing enjoyment in a human-robot dialogue corpus at both turn and interaction levels. Results indicate that reasoning capabilities have complex, model-dependent effects rather than universal benefits. While performance was nearly identical at the interaction level (0.44 vs 0.43), reasoning models substantially outperformed at the turn level (0.42 vs 0.36). Notably, LLMs correlated better with users’ self-reported enjoyment metrics than human annotators, despite achieving lower accuracy against human consensus ratings. Analysis revealed distinctive error patterns: non-reasoning models showed bias toward positive ratings at the turn level, while both model types exhibited central tendency bias at the interaction level. These findings suggest that reasoning should be applied selectively based on model architecture and assessment context, with assessment granularity significantly influencing relative effectiveness.

Place, publisher, year, edition, pages
SIGDIAL, 2025
National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-374881 (URN)
Conference
The 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Avignon, France, Aug 25-27, 2025
Note

Part of ISBN 979-8-89176-329-6

QC 20260107

Available from: 2026-01-06 Created: 2026-01-06 Last updated: 2026-01-07Bibliographically approved
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2165-2169). International Speech Communication Association
Open this publication in new window or tab >>Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2165-2169Conference paper, Published paper (Refereed)
Abstract [en]

We present an investigation into adaptable speech synthesis for noisy environments. Leveraging a zero-shot TTS we synthesized a corpus of 1,200 speech samples from 100 sentences of varying complexity, each generated at six distinct levels of vocal effort. To simulate realistic listening conditions, the synthesized speech is merged with environmental noise recordings from a diverse range of indoor and transportation settings at nine different signal-to-noise ratios. We assess the intelligibility of the resulting noisy speech using the ASR word error rates across conditions. Additionally, the input text was evaluated using four metrics on sentence complexity and word predictability. A number of regression models that used noise type, SNR, vocal effort and text as input were trained to predict ASR WER. Results show that increased vocal effort improves intelligibility, with benefits up to 30% in adverse conditions, most most pronounced in environments with competing speech at low SNRs.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
noisy environments, speech adaptation, speech intelligibility, speech synthesis
National Category
Natural Language Processing Signal Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-372805 (URN)10.21437/Interspeech.2025-2787 (DOI)2-s2.0-105020064005 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251113

Available from: 2025-11-13 Created: 2025-11-13 Last updated: 2025-11-13Bibliographically approved
Lameris, H., Gustafsson, J. & Székely, É. (2025). VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2295-2299). International Speech Communication Association
Open this publication in new window or tab >>VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2295-2299Conference paper, Published paper (Refereed)
Abstract [en]

Voice quality is an often overlooked aspect of speech with many communicative functions. Voice quality conveys both paralinguistic and pragmatic information, such as signalling speaker stance and aids in grounding. In this paper, we present VoiceQualityVC, a tool that can manipulate the voice quality of both natural and synthesized speech using voice quality features including CPPS, H1-H2, and H1-A3. VoiceQualityVC is a research tool for perceptual experiments into voice quality and UX experiments for voice design. We perform an objective evaluation demonstrating the control of these features as well as subjective listening tests of the paralinguistic attributes of intimacy, valence, and investment. In these listening tests breathy voice was rated as more intimate and more invested than modal voice and creaky voice was rated as less intimate and less positive. The code and models can be found at https://github.com/Hfkml/VQVC.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
Paralinguistics, Pragmatics, Voice conversion, Voice quality
National Category
Comparative Language Studies and Linguistics Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-372784 (URN)10.21437/Interspeech.2025-902 (DOI)2-s2.0-105020036268 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251120

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-20Bibliographically approved
Marcinek, L., Beskow, J. & Gustafsson, J. (2024). A dual-control dialogue framework for human-robot interaction data collection: integrating human emotional and contextual awareness with conversational AI. In: International Conference of Social Robotics (ICSR 2024): . Paper presented at International Conference of Social Robotics (ICSR 2024), Odense, Denmark, 24-26 October, 2024.
Open this publication in new window or tab >>A dual-control dialogue framework for human-robot interaction data collection: integrating human emotional and contextual awareness with conversational AI
2024 (English)In: International Conference of Social Robotics (ICSR 2024), 2024Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with realtime human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions.

Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audiovisual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-375300 (URN)
Conference
International Conference of Social Robotics (ICSR 2024), Odense, Denmark, 24-26 October, 2024
Note

QC 20260112

Available from: 2026-01-12 Created: 2026-01-12 Last updated: 2026-01-12Bibliographically approved
Francis, J., Székely, É. & Gustafsson, J. (2024). ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1001-1002). International Speech Communication Association
Open this publication in new window or tab >>ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1001-1002Conference paper, Published paper (Refereed)
Abstract [en]

Recent developments in generative language modeling and conversational Text-to-Speech present transformative potential for enhancing Augmentative and Alternative Communication (AAC) devices. Practical application of these technologies requires extensive research and testing. To address this, we introduce ConnecTone, a modular platform designed for rapid integration and testing of language generation and speech technology. ConnecTone implements context-sensitive generative text prediction, using conversational context from Automatic Speech Recognition inputs. The system incorporates a neural TTS that supports interpolation between reading and spontaneous conversational styles, along with adjustable prosodic features. These speech characteristics are predicted using Large Language Models, but can be adjusted by users for individual needs. We anticipate ConnecTone will enable us to rapidly evaluate and implement innovations, thereby contributing to faster benefit delivery to AAC users.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
AAC, Human-Computer Interaction, Speech Synthesis, TTS
National Category
Natural Language Processing Computer Sciences Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-358873 (URN)2-s2.0-85214814511 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250124

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-24Bibliographically approved
Wang, S., Székely, É. & Gustafsson, J. (2024). Contextual Interactive Evaluation of TTS Models in Dialogue Systems. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2965-2969). International Speech Communication Association
Open this publication in new window or tab >>Contextual Interactive Evaluation of TTS Models in Dialogue Systems
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2965-2969Conference paper, Published paper (Refereed)
Abstract [en]

Evaluation of text-to-speech (TTS) models is currently dominated by Mean-Opinion-Score (MOS) listening test, but MOS has been increasingly questioned for its validity. MOS tests place listeners in a passive setup, in which they do not actively interact with the TTS and usually evaluate isolated utterances without context. Thus it gives no indication for how well a TTS model suits an interactive application like spoken dialogue system, in which the capability of generating appropriate speech in the dialogue context is paramount. We aim to take a first step towards addressing this shortcoming by evaluating several state-of-the-art neural TTS models, including one that adapts to dialogue context, in a custom-built spoken dialogue system. We present system design, experiment setup, and results. Our work is the first to evaluate TTS in contextual dialogue system interactions. We also discuss the shortcomings and future opportunities of the proposed evaluation paradigm.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
evaluation methodology, human-computer interaction, spoken dialogue system, text-to-speech
National Category
Natural Language Processing Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-358876 (URN)10.21437/Interspeech.2024-1008 (DOI)001331850103017 ()2-s2.0-85214809755 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-05Bibliographically approved
Lameris, H., Gustafsson, J. & Székely, É. (2024). CreakVC: A Voice Conversion Tool for Modulating Creaky Voice. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1005-1006). International Speech Communication Association
Open this publication in new window or tab >>CreakVC: A Voice Conversion Tool for Modulating Creaky Voice
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1005-1006Conference paper, Published paper (Refereed)
Abstract [en]

We introduce a human-in-the-loop one-shot voice conversion tool called CreakVC designed to modulate the level of creaky voice in the converted speech. Creaky voice, often used by speakers to convey sociolinguistic cues, presents challenges to speech processing due to its complex phonation characteristics. The primary goal of CreakVC is to enable in-depth research into how these cues are perceived, using systematic perceptual studies. CreakVC provides access to a diverse range of voice identities exhibiting creaky voice, while maintaining consistency in other parameters. We developed a spectrogram-frame level creak representation using CreaPy and finetuned FreeVC, a one-shot voice conversion tool, by conditioning the speaker embedding and the self-supervised audio representation with the creak representation. An integrated plotting feature allows users to visualize and manipulate portions of speech for precise adjustments of creaky phonation levels. Beyond research, CreakVC has potential applications in voice-interactive systems and multimedia production.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
creaky voice, TTS, voice conversion
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-358875 (URN)2-s2.0-85214828772 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250124

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-24Bibliographically approved
Abelho Pereira, A. T., Marcinek, L., Miniotaitė, J., Thunberg, S., Lagerstedt, E., Gustafsson, J., . . . Irfan, B. (2024). Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models. In: : . Paper presented at 26th International Conference on Multimodal Interaction (ICMI), San Jose, USA, November 4-8, 2024 (pp. 469-478). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models
Show others...
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Enjoyment is a crucial yet complex indicator of positive user experience in Human-Robot Interaction (HRI). While manual enjoyment annotation is feasible, developing reliable automatic detection methods remains a challenge. This paper investigates a multimodal approach to automatic enjoyment annotation for HRI conversations, leveraging large language models (LLMs), visual, audio, and temporal cues. Our findings demonstrate that both text-only and multimodal LLMs with carefully designed prompts can achieve performance comparable to human annotators in detecting user enjoyment. Furthermore, results reveal a stronger alignment between LLM-based annotations and user self-reports of enjoyment compared to human annotators. While multimodal supervised learning techniques did not improve all of our performance metrics, they could successfully replicate human annotators and highlighted the importance of visual and audio cues in detecting subtle shifts in enjoyment. This research demonstrates the potential of LLMs for real-time enjoyment detection, paving the way for adaptive companion robots that can dynamically enhance user experiences.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Afect Recognition, Human-Robot Interaction, Large Language Models, Multimodal, Older Adults, User Enjoyment
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-359146 (URN)10.1145/3678957.3685729 (DOI)001433669800051 ()2-s2.0-85212589337 (Scopus ID)
Conference
26th International Conference on Multimodal Interaction (ICMI), San Jose, USA, November 4-8, 2024
Note

QC 20250127

Available from: 2025-01-27 Created: 2025-01-27 Last updated: 2025-04-30Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0397-6442

Search in DiVA

Show all publications