Endre søk
Link to record
Permanent link

Direct link
Gustafsson, Joakim, ProfessorORCID iD iconorcid.org/0000-0002-0397-6442
Alternativa namn
Publikasjoner (10 av 172) Visa alla publikasjoner
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI. In: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings: . Paper presented at 16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024 (pp. 290-297). Springer Nature
Åpne denne publikasjonen i ny fane eller vindu >>A Dual-Control Dialogue Framework for Human-Robot Interaction Data Collection: Integrating Human Emotional and Contextual Awareness with Conversational AI
2025 (engelsk)Inngår i: Social Robotics - 16th International Conference, ICSR + AI 2024, Proceedings, Springer Nature , 2025, s. 290-297Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with real-time human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions. Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audio-visual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

sted, utgiver, år, opplag, sider
Springer Nature, 2025
Emneord
Dialogue system, Emotions, Situational Context
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-362497 (URN)10.1007/978-981-96-3519-1_27 (DOI)001531735400027 ()2-s2.0-105002141806 (Scopus ID)
Konferanse
16th International Conference on Social Robotics, ICSR + AI 2024, Odense, Denmark, Oct 23 2024 - Oct 26 2024
Merknad

Part of ISBN 9789819635184

QC 20250424

Tilgjengelig fra: 2025-04-16 Laget: 2025-04-16 Sist oppdatert: 2025-12-08bibliografisk kontrollert
Francis, J., Gustafsson, J. & Székely, É. (2025). From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4960-4962). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 4960-4962Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

This paper presents an Augmentative and Alternative Communication (AAC) approach for minimally verbal children with Autism Spectrum Disorder. Traditional AAC systems use fixed symbol sets and pre-defined Text-to-Speech (TTS) voices, this proposed method leverages text-to-image generation and zero-shot TTS to expand expressive capabilities. Users can create visual symbols for concepts and interests, enabling richer communication. Further, zero-shot TTS allows users to upload or record personalized voices, enabling users to have individualized output. By minimizing reliance on static symbols and voices, this approach aims to increase communicative agency, personal relevance, and social validity, areas often neglected in traditional interventions. Future research will explore long-term effects on communicative skills, user satisfaction, social engagement, and adaptability across various cultural and linguistic settings, aiming to develop more dynamic and personalized AAC solutions.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
AAC, Human-Computer Interaction, Speech Synthesis, TTS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372783 (URN)10.21437/Interspeech.2025-2815 (DOI)2-s2.0-105020070493 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251124

Tilgjengelig fra: 2025-11-24 Laget: 2025-11-24 Sist oppdatert: 2025-11-24bibliografisk kontrollert
Marcinek, L., Irfan, B., Skantze, G., Abelho Pereira, A. T. & Gustafsson, J. (2025). Role of Reasoning in LLM Enjoyment Detection: Evaluation Across Conversational Levels for Human-Robot Interaction. In: Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin (Ed.), Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue: . Paper presented at The 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Avignon, France, Aug 25-27, 2025 (pp. 573-590). SIGDIAL
Åpne denne publikasjonen i ny fane eller vindu >>Role of Reasoning in LLM Enjoyment Detection: Evaluation Across Conversational Levels for Human-Robot Interaction
Vise andre…
2025 (engelsk)Inngår i: Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue / [ed] Frédéric Béchet, Fabrice Lefèvre, Nicholas Asher, Seokhwan Kim, Teva Merlin, SIGDIAL , 2025, s. 573-590Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

User enjoyment is central to developing conversational AI systems that can recover from failures and maintain interest over time. However, existing approaches often struggle to detect subtle cues that reflect user experience. Large Language Models (LLMs) with reasoning capabilities have outperformed standard models on various other tasks, suggesting potential benefits for enjoyment detection. This study investigates whether models with reasoning capabilities outperform standard models when assessing enjoyment in a human-robot dialogue corpus at both turn and interaction levels. Results indicate that reasoning capabilities have complex, model-dependent effects rather than universal benefits. While performance was nearly identical at the interaction level (0.44 vs 0.43), reasoning models substantially outperformed at the turn level (0.42 vs 0.36). Notably, LLMs correlated better with users’ self-reported enjoyment metrics than human annotators, despite achieving lower accuracy against human consensus ratings. Analysis revealed distinctive error patterns: non-reasoning models showed bias toward positive ratings at the turn level, while both model types exhibited central tendency bias at the interaction level. These findings suggest that reasoning should be applied selectively based on model architecture and assessment context, with assessment granularity significantly influencing relative effectiveness.

sted, utgiver, år, opplag, sider
SIGDIAL, 2025
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-374881 (URN)
Konferanse
The 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Avignon, France, Aug 25-27, 2025
Merknad

Part of ISBN 979-8-89176-329-6

QC 20260107

Tilgjengelig fra: 2026-01-06 Laget: 2026-01-06 Sist oppdatert: 2026-01-07bibliografisk kontrollert
Marcinek, L., Beskow, J. & Gustafsson, J. (2025). Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2165-2169). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Towards Adaptable and Intelligible Speech Synthesis in Noisy Environments
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2165-2169Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We present an investigation into adaptable speech synthesis for noisy environments. Leveraging a zero-shot TTS we synthesized a corpus of 1,200 speech samples from 100 sentences of varying complexity, each generated at six distinct levels of vocal effort. To simulate realistic listening conditions, the synthesized speech is merged with environmental noise recordings from a diverse range of indoor and transportation settings at nine different signal-to-noise ratios. We assess the intelligibility of the resulting noisy speech using the ASR word error rates across conditions. Additionally, the input text was evaluated using four metrics on sentence complexity and word predictability. A number of regression models that used noise type, SNR, vocal effort and text as input were trained to predict ASR WER. Results show that increased vocal effort improves intelligibility, with benefits up to 30% in adverse conditions, most most pronounced in environments with competing speech at low SNRs.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
noisy environments, speech adaptation, speech intelligibility, speech synthesis
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372805 (URN)10.21437/Interspeech.2025-2787 (DOI)2-s2.0-105020064005 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251113

Tilgjengelig fra: 2025-11-13 Laget: 2025-11-13 Sist oppdatert: 2025-11-13bibliografisk kontrollert
Lameris, H., Gustafsson, J. & Székely, É. (2025). VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2295-2299). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2295-2299Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Voice quality is an often overlooked aspect of speech with many communicative functions. Voice quality conveys both paralinguistic and pragmatic information, such as signalling speaker stance and aids in grounding. In this paper, we present VoiceQualityVC, a tool that can manipulate the voice quality of both natural and synthesized speech using voice quality features including CPPS, H1-H2, and H1-A3. VoiceQualityVC is a research tool for perceptual experiments into voice quality and UX experiments for voice design. We perform an objective evaluation demonstrating the control of these features as well as subjective listening tests of the paralinguistic attributes of intimacy, valence, and investment. In these listening tests breathy voice was rated as more intimate and more invested than modal voice and creaky voice was rated as less intimate and less positive. The code and models can be found at https://github.com/Hfkml/VQVC.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
Paralinguistics, Pragmatics, Voice conversion, Voice quality
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372784 (URN)10.21437/Interspeech.2025-902 (DOI)2-s2.0-105020036268 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251120

Tilgjengelig fra: 2025-11-20 Laget: 2025-11-20 Sist oppdatert: 2025-11-20bibliografisk kontrollert
Marcinek, L., Beskow, J. & Gustafsson, J. (2024). A dual-control dialogue framework for human-robot interaction data collection: integrating human emotional and contextual awareness with conversational AI. In: International Conference of Social Robotics (ICSR 2024): . Paper presented at International Conference of Social Robotics (ICSR 2024), Odense, Denmark, 24-26 October, 2024.
Åpne denne publikasjonen i ny fane eller vindu >>A dual-control dialogue framework for human-robot interaction data collection: integrating human emotional and contextual awareness with conversational AI
2024 (engelsk)Inngår i: International Conference of Social Robotics (ICSR 2024), 2024Konferansepaper, Poster (with or without abstract) (Fagfellevurdert)
Abstract [en]

This paper presents a dialogue framework designed to capture human-robot interactions enriched with human-level situational awareness. The system integrates advanced large language models with realtime human-in-the-loop control. Central to this framework is an interaction manager that oversees information flow, turn-taking, and prosody control of a social robot’s responses. A key innovation is the control interface, enabling a human operator to perform tasks such as emotion recognition and action detection through a live video feed. The operator also manages high-level tasks, like topic shifts or behaviour instructions.

Input from the operator is incorporated into the dialogue context managed by GPT-4o, thereby influencing the ongoing interaction. This allows for the collection of interactional data from an automated system that leverages human-level emotional and situational awareness. The audiovisual data will be used to explore the impact of situational awareness on user behaviors in task-oriented human-robot interaction.

HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-375300 (URN)
Konferanse
International Conference of Social Robotics (ICSR 2024), Odense, Denmark, 24-26 October, 2024
Merknad

QC 20260112

Tilgjengelig fra: 2026-01-12 Laget: 2026-01-12 Sist oppdatert: 2026-01-12bibliografisk kontrollert
Francis, J., Székely, É. & Gustafsson, J. (2024). ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1001-1002). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 1001-1002Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Recent developments in generative language modeling and conversational Text-to-Speech present transformative potential for enhancing Augmentative and Alternative Communication (AAC) devices. Practical application of these technologies requires extensive research and testing. To address this, we introduce ConnecTone, a modular platform designed for rapid integration and testing of language generation and speech technology. ConnecTone implements context-sensitive generative text prediction, using conversational context from Automatic Speech Recognition inputs. The system incorporates a neural TTS that supports interpolation between reading and spontaneous conversational styles, along with adjustable prosodic features. These speech characteristics are predicted using Large Language Models, but can be adjusted by users for individual needs. We anticipate ConnecTone will enable us to rapidly evaluate and implement innovations, thereby contributing to faster benefit delivery to AAC users.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
AAC, Human-Computer Interaction, Speech Synthesis, TTS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358873 (URN)2-s2.0-85214814511 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Merknad

QC 20250124

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-01-24bibliografisk kontrollert
Wang, S., Székely, É. & Gustafsson, J. (2024). Contextual Interactive Evaluation of TTS Models in Dialogue Systems. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2965-2969). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Contextual Interactive Evaluation of TTS Models in Dialogue Systems
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2965-2969Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Evaluation of text-to-speech (TTS) models is currently dominated by Mean-Opinion-Score (MOS) listening test, but MOS has been increasingly questioned for its validity. MOS tests place listeners in a passive setup, in which they do not actively interact with the TTS and usually evaluate isolated utterances without context. Thus it gives no indication for how well a TTS model suits an interactive application like spoken dialogue system, in which the capability of generating appropriate speech in the dialogue context is paramount. We aim to take a first step towards addressing this shortcoming by evaluating several state-of-the-art neural TTS models, including one that adapts to dialogue context, in a custom-built spoken dialogue system. We present system design, experiment setup, and results. Our work is the first to evaluate TTS in contextual dialogue system interactions. We also discuss the shortcomings and future opportunities of the proposed evaluation paradigm.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
evaluation methodology, human-computer interaction, spoken dialogue system, text-to-speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358876 (URN)10.21437/Interspeech.2024-1008 (DOI)001331850103017 ()2-s2.0-85214809755 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250128

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-05bibliografisk kontrollert
Lameris, H., Gustafsson, J. & Székely, É. (2024). CreakVC: A Voice Conversion Tool for Modulating Creaky Voice. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1005-1006). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>CreakVC: A Voice Conversion Tool for Modulating Creaky Voice
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 1005-1006Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce a human-in-the-loop one-shot voice conversion tool called CreakVC designed to modulate the level of creaky voice in the converted speech. Creaky voice, often used by speakers to convey sociolinguistic cues, presents challenges to speech processing due to its complex phonation characteristics. The primary goal of CreakVC is to enable in-depth research into how these cues are perceived, using systematic perceptual studies. CreakVC provides access to a diverse range of voice identities exhibiting creaky voice, while maintaining consistency in other parameters. We developed a spectrogram-frame level creak representation using CreaPy and finetuned FreeVC, a one-shot voice conversion tool, by conditioning the speaker embedding and the self-supervised audio representation with the creak representation. An integrated plotting feature allows users to visualize and manipulate portions of speech for precise adjustments of creaky phonation levels. Beyond research, CreakVC has potential applications in voice-interactive systems and multimedia production.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
creaky voice, TTS, voice conversion
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358875 (URN)2-s2.0-85214828772 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Merknad

QC 20250124

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-01-24bibliografisk kontrollert
Abelho Pereira, A. T., Marcinek, L., Miniotaitė, J., Thunberg, S., Lagerstedt, E., Gustafsson, J., . . . Irfan, B. (2024). Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models. In: : . Paper presented at 26th International Conference on Multimodal Interaction (ICMI), San Jose, USA, November 4-8, 2024 (pp. 469-478). Association for Computing Machinery (ACM)
Åpne denne publikasjonen i ny fane eller vindu >>Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models
Vise andre…
2024 (engelsk)Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Enjoyment is a crucial yet complex indicator of positive user experience in Human-Robot Interaction (HRI). While manual enjoyment annotation is feasible, developing reliable automatic detection methods remains a challenge. This paper investigates a multimodal approach to automatic enjoyment annotation for HRI conversations, leveraging large language models (LLMs), visual, audio, and temporal cues. Our findings demonstrate that both text-only and multimodal LLMs with carefully designed prompts can achieve performance comparable to human annotators in detecting user enjoyment. Furthermore, results reveal a stronger alignment between LLM-based annotations and user self-reports of enjoyment compared to human annotators. While multimodal supervised learning techniques did not improve all of our performance metrics, they could successfully replicate human annotators and highlighted the importance of visual and audio cues in detecting subtle shifts in enjoyment. This research demonstrates the potential of LLMs for real-time enjoyment detection, paving the way for adaptive companion robots that can dynamically enhance user experiences.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2024
Emneord
Afect Recognition, Human-Robot Interaction, Large Language Models, Multimodal, Older Adults, User Enjoyment
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-359146 (URN)10.1145/3678957.3685729 (DOI)001433669800051 ()2-s2.0-85212589337 (Scopus ID)
Konferanse
26th International Conference on Multimodal Interaction (ICMI), San Jose, USA, November 4-8, 2024
Merknad

QC 20250127

Tilgjengelig fra: 2025-01-27 Laget: 2025-01-27 Sist oppdatert: 2025-04-30bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-0397-6442