kth.sePublikationer KTH
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Alternativa namn
Publikationer (10 of 74) Visa alla publikationer
Bokkahalli Satish, S. H., Henter, G. E. & Székely, É. (2026). When Voice Matters: Evidence of Gender Disparity in Positional Bias of SpeechLLMs. In: Speech and Computer - 27th International Conference, SPECOM 2025, Proceedings: . Paper presented at 27th International Conference on Speech and Computer, SPECOM 2025, Szeged, Hungary, October 13-15, 2025 (pp. 25-38). Springer Nature
Öppna denna publikation i ny flik eller fönster >>When Voice Matters: Evidence of Gender Disparity in Positional Bias of SpeechLLMs
2026 (Engelska)Ingår i: Speech and Computer - 27th International Conference, SPECOM 2025, Proceedings, Springer Nature , 2026, s. 25-38Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

The rapid development of SpeechLLM-based conversational AI systems has created a need for robustly benchmarking these efforts, including aspects of fairness and bias. At present, such benchmarks typically rely on multiple choice question answering (MCQA). In this paper, we present the first token-level probabilistic evaluation and response-based study of several issues affecting the use of MCQA in SpeechLLM benchmarking: 1) we examine how model temperature and prompt design affect gender and positional bias on an MCQA gender-bias benchmark; 2) we examine how these biases are affected by the gender of the input voice; and 3) we study to what extent observed trends carry over to a second gender-bias benchmark. Our results show that concerns about positional bias from the text domain are equally valid in the speech domain. We also find the effect to be stronger for female voices than for male voices. To our knowledge, this is the first study to isolate positional bias effects in SpeechLLM-based gender-bias benchmarks. We conclude that current MCQA benchmarks do not account for speech-based bias and alternative strategies are needed to ensure fairness towards all users.

Ort, förlag, år, upplaga, sidor
Springer Nature, 2026
Nyckelord
Benchmark robustness, Positional bias, SpeechLLMs
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-372782 (URN)10.1007/978-3-032-07956-5_2 (DOI)2-s2.0-105020237079 (Scopus ID)
Konferens
27th International Conference on Speech and Computer, SPECOM 2025, Szeged, Hungary, October 13-15, 2025
Anmärkning

Part of ISBN 9783032079558

QC 20251120

Tillgänglig från: 2025-11-20 Skapad: 2025-11-20 Senast uppdaterad: 2025-11-20Bibliografiskt granskad
Francis, J., Gustafsson, J. & Székely, É. (2025). From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4960-4962). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 4960-4962Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper presents an Augmentative and Alternative Communication (AAC) approach for minimally verbal children with Autism Spectrum Disorder. Traditional AAC systems use fixed symbol sets and pre-defined Text-to-Speech (TTS) voices, this proposed method leverages text-to-image generation and zero-shot TTS to expand expressive capabilities. Users can create visual symbols for concepts and interests, enabling richer communication. Further, zero-shot TTS allows users to upload or record personalized voices, enabling users to have individualized output. By minimizing reliance on static symbols and voices, this approach aims to increase communicative agency, personal relevance, and social validity, areas often neglected in traditional interventions. Future research will explore long-term effects on communicative skills, user satisfaction, social engagement, and adaptability across various cultural and linguistic settings, aiming to develop more dynamic and personalized AAC solutions.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
AAC, Human-Computer Interaction, Speech Synthesis, TTS
Nationell ämneskategori
Språkbehandling och datorlingvistik Människa-datorinteraktion (interaktionsdesign) Annan teknik
Identifikatorer
urn:nbn:se:kth:diva-372783 (URN)10.21437/Interspeech.2025-2815 (DOI)2-s2.0-105020070493 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251124

Tillgänglig från: 2025-11-24 Skapad: 2025-11-24 Senast uppdaterad: 2025-11-24Bibliografiskt granskad
Bokkahalli Satish, S. H., Henter, G. E. & Székely, É. (2025). Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2151-2152). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2151-2152Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

A new wave of speech foundation models is emerging, capable of processing spoken language directly from audio. These models promise more expressive and emotionally aware interactions by retaining prosodic information throughout conversations. 'Hear Me Out' evaluates their ability to preserve crucial vocal cues, enabling users to explore how variations in speaker characteristics and paralinguistic features influence AI responses. Through real-time voice conversion, users can ask a question and then re-ask it with a modified one, immediately observing differences in response tone, phrasing, and behavior. The system presents paired responses side by side, offering direct comparisons of AI interpretations of both the original and transformed voices, thereby highlighting potential biases. By creating inquiry into speaker modeling, contextual understanding, and fairness, this immersive experience encourages users to reflect on identity, voice, and also promote inclusive future research.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
bias in conversational AI, speech-to-speech conversational AI, voice conversion
Nationell ämneskategori
Språkbehandling och datorlingvistik Människa-datorinteraktion (interaktionsdesign) Datavetenskap (datalogi) Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-372786 (URN)2-s2.0-105020052310 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251120

Tillgänglig från: 2025-11-20 Skapad: 2025-11-20 Senast uppdaterad: 2025-11-20Bibliografiskt granskad
Jacka, R., Peña, P. R., Leonard, S. J., Székely, É. & Cowan, B. R. (2025). Impact Of Disfluent Speech Agent On Partner Models And Perspectve Taking. In: CUI 2025 - Proceedings of the 2025 ACM Conference on Conversational User Interfaces: . Paper presented at 7th Conference on Conversational User Interfaces, CUI 2025, Waterloo, Canada, Jul 8 2025 - Jul 10 2025. Association for Computing Machinery (ACM), Article ID 14.
Öppna denna publikation i ny flik eller fönster >>Impact Of Disfluent Speech Agent On Partner Models And Perspectve Taking
Visa övriga...
2025 (Engelska)Ingår i: CUI 2025 - Proceedings of the 2025 ACM Conference on Conversational User Interfaces, Association for Computing Machinery (ACM) , 2025, artikel-id 14Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Speech disfluencies play a role in perspective-taking and audience design in human-human communication (HHC), but little is known about their impact in human-machine dialogue (HMD). In an online Namer-Matcher task, sixty-one participants interacted with a speech agent using either fluent or disfluent speech. Participants completed a partner-modelling questionnaire (PMQ) both before and after the task. Post-interaction evaluations indicated that participants perceived the disfluent agent as more competent, despite no significant differences in pre-task ratings. However, no notable differences were observed in assessments of conversational flexibility or human-likeness. Our findings also reveal evidence of egocentric and allocentric language production when participants interact with speech agents. Interaction with disfluent speech agents appears to increase egocentric communication in comparison to fluent agents. Although the wide credibility intervals mean this effect is not clear-cut. We discuss potential interpretations of this finding, focusing on how disfluencies may impact partner models and language production in HMD.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2025
Nyckelord
Conversational Agents, Disfluency, Perspective-Taking
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik Människa-datorinteraktion (interaktionsdesign) Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-369073 (URN)10.1145/3719160.3737625 (DOI)001539402100012 ()2-s2.0-105011598139 (Scopus ID)
Konferens
7th Conference on Conversational User Interfaces, CUI 2025, Waterloo, Canada, Jul 8 2025 - Jul 10 2025
Anmärkning

Part of ISBN 9798400715273

QC 20250922

Tillgänglig från: 2025-09-22 Skapad: 2025-09-22 Senast uppdaterad: 2025-09-22Bibliografiskt granskad
Székely, É., Mihajlik, P., Kádár, M. S. & Tóth, L. (2025). Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2735-2739). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2735-2739Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Personalised synthetic speech can enhance communication for Augmentative and Alternative Communication (AAC) users, but achieving high-quality, speaker-specific voices depends on various factors such as the condition causing speech loss, and availability of recorded speech. Recent advancements in large-scale zero-shot TTS models may change the data requirements, as they have the potential to adapt to a wider range of inputs. This paper explores the potential of these pretrained models in various data availability scenarios, from extensive spontaneous speech to minimal or no unaffected speech. We evaluate a state-of-the-art TTS system on a case study involving a stroke survivor with dysarthria, leveraging both typical and atypical speech data. Additionally, we introduce a novel interactive approach using dysarthric speech as an audio prompt to enable user-guided prosody adaptation.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
assistive communication, augmentative and alternative communication, dysarthric speech, speech synthesis
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-372804 (URN)10.21437/Interspeech.2025-1726 (DOI)2-s2.0-105020070750 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251113

Tillgänglig från: 2025-11-13 Skapad: 2025-11-13 Senast uppdaterad: 2025-11-13Bibliografiskt granskad
Lameris, H., Gustafsson, J. & Székely, É. (2025). VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2295-2299). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2295-2299Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Voice quality is an often overlooked aspect of speech with many communicative functions. Voice quality conveys both paralinguistic and pragmatic information, such as signalling speaker stance and aids in grounding. In this paper, we present VoiceQualityVC, a tool that can manipulate the voice quality of both natural and synthesized speech using voice quality features including CPPS, H1-H2, and H1-A3. VoiceQualityVC is a research tool for perceptual experiments into voice quality and UX experiments for voice design. We perform an objective evaluation demonstrating the control of these features as well as subjective listening tests of the paralinguistic attributes of intimacy, valence, and investment. In these listening tests breathy voice was rated as more intimate and more invested than modal voice and creaky voice was rated as less intimate and less positive. The code and models can be found at https://github.com/Hfkml/VQVC.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
Paralinguistics, Pragmatics, Voice conversion, Voice quality
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-372784 (URN)10.21437/Interspeech.2025-902 (DOI)2-s2.0-105020036268 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251120

Tillgänglig från: 2025-11-20 Skapad: 2025-11-20 Senast uppdaterad: 2025-11-20Bibliografiskt granskad
Hope, M. & Székely, É. (2025). Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 689-693). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 689-693Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Speech-generating devices (SGDs) provide users with text-to-speech (TTS) voices that shape identity and self-expression. Current TTS voices enable self-expression but often lack customizable features for authentic voice embodiment, particularly for nonbinary SGD users seeking gender affirmation as existing TTS voices largely reproduce binary, cisgender speech patterns. This study examines how nonbinary SGD users embody, or disembody, synthetic voices and the factors influencing voice affirmation. Through a survey, we analyze experiences of nonbinary SGD users and their impressions of generated speech samples, investigating the role of technological possibilities in gender affirmation and voice embodiment. Findings inform the creation of more user-centered TTS technologies, and challenge dominant paradigms in speech technology, gesturing toward a posthumanist rethinking of voice as co-constructed between human and machine.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
human-computer interaction, nonbinary, posthuman, text-to-speech, transgender, voice embodiment
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-372793 (URN)10.21437/Interspeech.2025-2229 (DOI)2-s2.0-105020063225 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251118

Tillgänglig från: 2025-11-18 Skapad: 2025-11-18 Senast uppdaterad: 2025-11-18Bibliografiskt granskad
Puhach, D., Payberah, A. H. & Székely, É. (2025). Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2058-2062). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2058-2062Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark's speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
gender bias, speech synthesis, speech-LLM, TTS
Nationell ämneskategori
Språkbehandling och datorlingvistik Genusstudier
Identifikatorer
urn:nbn:se:kth:diva-372806 (URN)10.21437/Interspeech.2025-1402 (DOI)2-s2.0-105020092426 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251113

Tillgänglig från: 2025-11-13 Skapad: 2025-11-13 Senast uppdaterad: 2025-11-13Bibliografiskt granskad
Székely, É. & Hope, M. (2024). An inclusive approach to creating a palette of synthetic voices for gender diversity. In: Proc. Interspeech 2024: . Paper presented at Interspeech (pp. 3070-3074).
Öppna denna publikation i ny flik eller fönster >>An inclusive approach to creating a palette of synthetic voices for gender diversity
2024 (Engelska)Ingår i: Proc. Interspeech 2024, 2024, s. 3070-3074Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component.The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.

Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-367540 (URN)
Konferens
Interspeech
Anmärkning

QC 20250731

Tillgänglig från: 2025-07-22 Skapad: 2025-07-22 Senast uppdaterad: 2025-07-31Bibliografiskt granskad
Székely, É. & Hope, M. (2024). An inclusive approach to creating a palette of synthetic voices for gender diversity. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 3070-3074). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>An inclusive approach to creating a palette of synthetic voices for gender diversity
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 3070-3074Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component. The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
alternative communication, augmentative, diversity, gender and speech, gender expansive, inclusion, nonbinary, speech generating devices, TTS
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-358869 (URN)10.21437/Interspeech.2024-1543 (DOI)001331850103037 ()2-s2.0-85214804000 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Anmärkning

QC 20250128

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-12-05Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-1175-840X

Sök vidare i DiVA

Visa alla publikationer