kth.sePublications KTH
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 74) Show all publications
Bokkahalli Satish, S. H., Henter, G. E. & Székely, É. (2026). When Voice Matters: Evidence of Gender Disparity in Positional Bias of SpeechLLMs. In: Speech and Computer - 27th International Conference, SPECOM 2025, Proceedings: . Paper presented at 27th International Conference on Speech and Computer, SPECOM 2025, Szeged, Hungary, October 13-15, 2025 (pp. 25-38). Springer Nature
Open this publication in new window or tab >>When Voice Matters: Evidence of Gender Disparity in Positional Bias of SpeechLLMs
2026 (English)In: Speech and Computer - 27th International Conference, SPECOM 2025, Proceedings, Springer Nature , 2026, p. 25-38Conference paper, Published paper (Refereed)
Abstract [en]

The rapid development of SpeechLLM-based conversational AI systems has created a need for robustly benchmarking these efforts, including aspects of fairness and bias. At present, such benchmarks typically rely on multiple choice question answering (MCQA). In this paper, we present the first token-level probabilistic evaluation and response-based study of several issues affecting the use of MCQA in SpeechLLM benchmarking: 1) we examine how model temperature and prompt design affect gender and positional bias on an MCQA gender-bias benchmark; 2) we examine how these biases are affected by the gender of the input voice; and 3) we study to what extent observed trends carry over to a second gender-bias benchmark. Our results show that concerns about positional bias from the text domain are equally valid in the speech domain. We also find the effect to be stronger for female voices than for male voices. To our knowledge, this is the first study to isolate positional bias effects in SpeechLLM-based gender-bias benchmarks. We conclude that current MCQA benchmarks do not account for speech-based bias and alternative strategies are needed to ensure fairness towards all users.

Place, publisher, year, edition, pages
Springer Nature, 2026
Keywords
Benchmark robustness, Positional bias, SpeechLLMs
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-372782 (URN)10.1007/978-3-032-07956-5_2 (DOI)2-s2.0-105020237079 (Scopus ID)
Conference
27th International Conference on Speech and Computer, SPECOM 2025, Szeged, Hungary, October 13-15, 2025
Note

Part of ISBN 9783032079558

QC 20251120

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-20Bibliographically approved
Francis, J., Gustafsson, J. & Székely, É. (2025). From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4960-4962). International Speech Communication Association
Open this publication in new window or tab >>From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 4960-4962Conference paper, Published paper (Refereed)
Abstract [en]

This paper presents an Augmentative and Alternative Communication (AAC) approach for minimally verbal children with Autism Spectrum Disorder. Traditional AAC systems use fixed symbol sets and pre-defined Text-to-Speech (TTS) voices, this proposed method leverages text-to-image generation and zero-shot TTS to expand expressive capabilities. Users can create visual symbols for concepts and interests, enabling richer communication. Further, zero-shot TTS allows users to upload or record personalized voices, enabling users to have individualized output. By minimizing reliance on static symbols and voices, this approach aims to increase communicative agency, personal relevance, and social validity, areas often neglected in traditional interventions. Future research will explore long-term effects on communicative skills, user satisfaction, social engagement, and adaptability across various cultural and linguistic settings, aiming to develop more dynamic and personalized AAC solutions.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
AAC, Human-Computer Interaction, Speech Synthesis, TTS
National Category
Natural Language Processing Human Computer Interaction Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-372783 (URN)10.21437/Interspeech.2025-2815 (DOI)2-s2.0-105020070493 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251124

Available from: 2025-11-24 Created: 2025-11-24 Last updated: 2025-11-24Bibliographically approved
Bokkahalli Satish, S. H., Henter, G. E. & Székely, É. (2025). Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2151-2152). International Speech Communication Association
Open this publication in new window or tab >>Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2151-2152Conference paper, Published paper (Refereed)
Abstract [en]

A new wave of speech foundation models is emerging, capable of processing spoken language directly from audio. These models promise more expressive and emotionally aware interactions by retaining prosodic information throughout conversations. 'Hear Me Out' evaluates their ability to preserve crucial vocal cues, enabling users to explore how variations in speaker characteristics and paralinguistic features influence AI responses. Through real-time voice conversion, users can ask a question and then re-ask it with a modified one, immediately observing differences in response tone, phrasing, and behavior. The system presents paired responses side by side, offering direct comparisons of AI interpretations of both the original and transformed voices, thereby highlighting potential biases. By creating inquiry into speaker modeling, contextual understanding, and fairness, this immersive experience encourages users to reflect on identity, voice, and also promote inclusive future research.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
bias in conversational AI, speech-to-speech conversational AI, voice conversion
National Category
Natural Language Processing Human Computer Interaction Computer Sciences Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-372786 (URN)2-s2.0-105020052310 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251120

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-20Bibliographically approved
Jacka, R., Peña, P. R., Leonard, S. J., Székely, É. & Cowan, B. R. (2025). Impact Of Disfluent Speech Agent On Partner Models And Perspectve Taking. In: CUI 2025 - Proceedings of the 2025 ACM Conference on Conversational User Interfaces: . Paper presented at 7th Conference on Conversational User Interfaces, CUI 2025, Waterloo, Canada, Jul 8 2025 - Jul 10 2025. Association for Computing Machinery (ACM), Article ID 14.
Open this publication in new window or tab >>Impact Of Disfluent Speech Agent On Partner Models And Perspectve Taking
Show others...
2025 (English)In: CUI 2025 - Proceedings of the 2025 ACM Conference on Conversational User Interfaces, Association for Computing Machinery (ACM) , 2025, article id 14Conference paper, Published paper (Refereed)
Abstract [en]

Speech disfluencies play a role in perspective-taking and audience design in human-human communication (HHC), but little is known about their impact in human-machine dialogue (HMD). In an online Namer-Matcher task, sixty-one participants interacted with a speech agent using either fluent or disfluent speech. Participants completed a partner-modelling questionnaire (PMQ) both before and after the task. Post-interaction evaluations indicated that participants perceived the disfluent agent as more competent, despite no significant differences in pre-task ratings. However, no notable differences were observed in assessments of conversational flexibility or human-likeness. Our findings also reveal evidence of egocentric and allocentric language production when participants interact with speech agents. Interaction with disfluent speech agents appears to increase egocentric communication in comparison to fluent agents. Although the wide credibility intervals mean this effect is not clear-cut. We discuss potential interpretations of this finding, focusing on how disfluencies may impact partner models and language production in HMD.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2025
Keywords
Conversational Agents, Disfluency, Perspective-Taking
National Category
Comparative Language Studies and Linguistics Human Computer Interaction Computer Sciences
Identifiers
urn:nbn:se:kth:diva-369073 (URN)10.1145/3719160.3737625 (DOI)001539402100012 ()2-s2.0-105011598139 (Scopus ID)
Conference
7th Conference on Conversational User Interfaces, CUI 2025, Waterloo, Canada, Jul 8 2025 - Jul 10 2025
Note

Part of ISBN 9798400715273

QC 20250922

Available from: 2025-09-22 Created: 2025-09-22 Last updated: 2025-09-22Bibliographically approved
Székely, É., Mihajlik, P., Kádár, M. S. & Tóth, L. (2025). Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2735-2739). International Speech Communication Association
Open this publication in new window or tab >>Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2735-2739Conference paper, Published paper (Refereed)
Abstract [en]

Personalised synthetic speech can enhance communication for Augmentative and Alternative Communication (AAC) users, but achieving high-quality, speaker-specific voices depends on various factors such as the condition causing speech loss, and availability of recorded speech. Recent advancements in large-scale zero-shot TTS models may change the data requirements, as they have the potential to adapt to a wider range of inputs. This paper explores the potential of these pretrained models in various data availability scenarios, from extensive spontaneous speech to minimal or no unaffected speech. We evaluate a state-of-the-art TTS system on a case study involving a stroke survivor with dysarthria, leveraging both typical and atypical speech data. Additionally, we introduce a novel interactive approach using dysarthric speech as an audio prompt to enable user-guided prosody adaptation.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
assistive communication, augmentative and alternative communication, dysarthric speech, speech synthesis
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-372804 (URN)10.21437/Interspeech.2025-1726 (DOI)2-s2.0-105020070750 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251113

Available from: 2025-11-13 Created: 2025-11-13 Last updated: 2025-11-13Bibliographically approved
Lameris, H., Gustafsson, J. & Székely, É. (2025). VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2295-2299). International Speech Communication Association
Open this publication in new window or tab >>VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2295-2299Conference paper, Published paper (Refereed)
Abstract [en]

Voice quality is an often overlooked aspect of speech with many communicative functions. Voice quality conveys both paralinguistic and pragmatic information, such as signalling speaker stance and aids in grounding. In this paper, we present VoiceQualityVC, a tool that can manipulate the voice quality of both natural and synthesized speech using voice quality features including CPPS, H1-H2, and H1-A3. VoiceQualityVC is a research tool for perceptual experiments into voice quality and UX experiments for voice design. We perform an objective evaluation demonstrating the control of these features as well as subjective listening tests of the paralinguistic attributes of intimacy, valence, and investment. In these listening tests breathy voice was rated as more intimate and more invested than modal voice and creaky voice was rated as less intimate and less positive. The code and models can be found at https://github.com/Hfkml/VQVC.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
Paralinguistics, Pragmatics, Voice conversion, Voice quality
National Category
Comparative Language Studies and Linguistics Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-372784 (URN)10.21437/Interspeech.2025-902 (DOI)2-s2.0-105020036268 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251120

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-20Bibliographically approved
Hope, M. & Székely, É. (2025). Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 689-693). International Speech Communication Association
Open this publication in new window or tab >>Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 689-693Conference paper, Published paper (Refereed)
Abstract [en]

Speech-generating devices (SGDs) provide users with text-to-speech (TTS) voices that shape identity and self-expression. Current TTS voices enable self-expression but often lack customizable features for authentic voice embodiment, particularly for nonbinary SGD users seeking gender affirmation as existing TTS voices largely reproduce binary, cisgender speech patterns. This study examines how nonbinary SGD users embody, or disembody, synthetic voices and the factors influencing voice affirmation. Through a survey, we analyze experiences of nonbinary SGD users and their impressions of generated speech samples, investigating the role of technological possibilities in gender affirmation and voice embodiment. Findings inform the creation of more user-centered TTS technologies, and challenge dominant paradigms in speech technology, gesturing toward a posthumanist rethinking of voice as co-constructed between human and machine.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
human-computer interaction, nonbinary, posthuman, text-to-speech, transgender, voice embodiment
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-372793 (URN)10.21437/Interspeech.2025-2229 (DOI)2-s2.0-105020063225 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251118

Available from: 2025-11-18 Created: 2025-11-18 Last updated: 2025-11-18Bibliographically approved
Puhach, D., Payberah, A. H. & Székely, É. (2025). Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2058-2062). International Speech Communication Association
Open this publication in new window or tab >>Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2058-2062Conference paper, Published paper (Refereed)
Abstract [en]

Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark's speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
gender bias, speech synthesis, speech-LLM, TTS
National Category
Natural Language Processing Gender Studies
Identifiers
urn:nbn:se:kth:diva-372806 (URN)10.21437/Interspeech.2025-1402 (DOI)2-s2.0-105020092426 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251113

Available from: 2025-11-13 Created: 2025-11-13 Last updated: 2025-11-13Bibliographically approved
Székely, É. & Hope, M. (2024). An inclusive approach to creating a palette of synthetic voices for gender diversity. In: Proc. Interspeech 2024: . Paper presented at Interspeech (pp. 3070-3074).
Open this publication in new window or tab >>An inclusive approach to creating a palette of synthetic voices for gender diversity
2024 (English)In: Proc. Interspeech 2024, 2024, p. 3070-3074Conference paper, Published paper (Refereed)
Abstract [en]

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component.The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.

National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-367540 (URN)
Conference
Interspeech
Note

QC 20250731

Available from: 2025-07-22 Created: 2025-07-22 Last updated: 2025-07-31Bibliographically approved
Székely, É. & Hope, M. (2024). An inclusive approach to creating a palette of synthetic voices for gender diversity. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 3070-3074). International Speech Communication Association
Open this publication in new window or tab >>An inclusive approach to creating a palette of synthetic voices for gender diversity
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 3070-3074Conference paper, Published paper (Refereed)
Abstract [en]

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component. The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
alternative communication, augmentative, diversity, gender and speech, gender expansive, inclusion, nonbinary, speech generating devices, TTS
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358869 (URN)10.21437/Interspeech.2024-1543 (DOI)001331850103037 ()2-s2.0-85214804000 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-05Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1175-840X

Search in DiVA

Show all publications