Endre søk
Link to record
Permanent link

Direct link
Alternativa namn
Publikasjoner (10 av 74) Visa alla publikasjoner
Bokkahalli Satish, S. H., Henter, G. E. & Székely, É. (2026). When Voice Matters: Evidence of Gender Disparity in Positional Bias of SpeechLLMs. In: Speech and Computer - 27th International Conference, SPECOM 2025, Proceedings: . Paper presented at 27th International Conference on Speech and Computer, SPECOM 2025, Szeged, Hungary, October 13-15, 2025 (pp. 25-38). Springer Nature
Åpne denne publikasjonen i ny fane eller vindu >>When Voice Matters: Evidence of Gender Disparity in Positional Bias of SpeechLLMs
2026 (engelsk)Inngår i: Speech and Computer - 27th International Conference, SPECOM 2025, Proceedings, Springer Nature , 2026, s. 25-38Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

The rapid development of SpeechLLM-based conversational AI systems has created a need for robustly benchmarking these efforts, including aspects of fairness and bias. At present, such benchmarks typically rely on multiple choice question answering (MCQA). In this paper, we present the first token-level probabilistic evaluation and response-based study of several issues affecting the use of MCQA in SpeechLLM benchmarking: 1) we examine how model temperature and prompt design affect gender and positional bias on an MCQA gender-bias benchmark; 2) we examine how these biases are affected by the gender of the input voice; and 3) we study to what extent observed trends carry over to a second gender-bias benchmark. Our results show that concerns about positional bias from the text domain are equally valid in the speech domain. We also find the effect to be stronger for female voices than for male voices. To our knowledge, this is the first study to isolate positional bias effects in SpeechLLM-based gender-bias benchmarks. We conclude that current MCQA benchmarks do not account for speech-based bias and alternative strategies are needed to ensure fairness towards all users.

sted, utgiver, år, opplag, sider
Springer Nature, 2026
Emneord
Benchmark robustness, Positional bias, SpeechLLMs
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372782 (URN)10.1007/978-3-032-07956-5_2 (DOI)2-s2.0-105020237079 (Scopus ID)
Konferanse
27th International Conference on Speech and Computer, SPECOM 2025, Szeged, Hungary, October 13-15, 2025
Merknad

Part of ISBN 9783032079558

QC 20251120

Tilgjengelig fra: 2025-11-20 Laget: 2025-11-20 Sist oppdatert: 2025-11-20bibliografisk kontrollert
Francis, J., Gustafsson, J. & Székely, É. (2025). From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4960-4962). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 4960-4962Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

This paper presents an Augmentative and Alternative Communication (AAC) approach for minimally verbal children with Autism Spectrum Disorder. Traditional AAC systems use fixed symbol sets and pre-defined Text-to-Speech (TTS) voices, this proposed method leverages text-to-image generation and zero-shot TTS to expand expressive capabilities. Users can create visual symbols for concepts and interests, enabling richer communication. Further, zero-shot TTS allows users to upload or record personalized voices, enabling users to have individualized output. By minimizing reliance on static symbols and voices, this approach aims to increase communicative agency, personal relevance, and social validity, areas often neglected in traditional interventions. Future research will explore long-term effects on communicative skills, user satisfaction, social engagement, and adaptability across various cultural and linguistic settings, aiming to develop more dynamic and personalized AAC solutions.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
AAC, Human-Computer Interaction, Speech Synthesis, TTS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372783 (URN)10.21437/Interspeech.2025-2815 (DOI)2-s2.0-105020070493 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251124

Tilgjengelig fra: 2025-11-24 Laget: 2025-11-24 Sist oppdatert: 2025-11-24bibliografisk kontrollert
Bokkahalli Satish, S. H., Henter, G. E. & Székely, É. (2025). Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2151-2152). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Hear Me Out: Interactive evaluation and bias discovery platform for speech-to-speech conversational AI
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2151-2152Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

A new wave of speech foundation models is emerging, capable of processing spoken language directly from audio. These models promise more expressive and emotionally aware interactions by retaining prosodic information throughout conversations. 'Hear Me Out' evaluates their ability to preserve crucial vocal cues, enabling users to explore how variations in speaker characteristics and paralinguistic features influence AI responses. Through real-time voice conversion, users can ask a question and then re-ask it with a modified one, immediately observing differences in response tone, phrasing, and behavior. The system presents paired responses side by side, offering direct comparisons of AI interpretations of both the original and transformed voices, thereby highlighting potential biases. By creating inquiry into speaker modeling, contextual understanding, and fairness, this immersive experience encourages users to reflect on identity, voice, and also promote inclusive future research.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
bias in conversational AI, speech-to-speech conversational AI, voice conversion
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372786 (URN)2-s2.0-105020052310 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251120

Tilgjengelig fra: 2025-11-20 Laget: 2025-11-20 Sist oppdatert: 2025-11-20bibliografisk kontrollert
Jacka, R., Peña, P. R., Leonard, S. J., Székely, É. & Cowan, B. R. (2025). Impact Of Disfluent Speech Agent On Partner Models And Perspectve Taking. In: CUI 2025 - Proceedings of the 2025 ACM Conference on Conversational User Interfaces: . Paper presented at 7th Conference on Conversational User Interfaces, CUI 2025, Waterloo, Canada, Jul 8 2025 - Jul 10 2025. Association for Computing Machinery (ACM), Article ID 14.
Åpne denne publikasjonen i ny fane eller vindu >>Impact Of Disfluent Speech Agent On Partner Models And Perspectve Taking
Vise andre…
2025 (engelsk)Inngår i: CUI 2025 - Proceedings of the 2025 ACM Conference on Conversational User Interfaces, Association for Computing Machinery (ACM) , 2025, artikkel-id 14Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Speech disfluencies play a role in perspective-taking and audience design in human-human communication (HHC), but little is known about their impact in human-machine dialogue (HMD). In an online Namer-Matcher task, sixty-one participants interacted with a speech agent using either fluent or disfluent speech. Participants completed a partner-modelling questionnaire (PMQ) both before and after the task. Post-interaction evaluations indicated that participants perceived the disfluent agent as more competent, despite no significant differences in pre-task ratings. However, no notable differences were observed in assessments of conversational flexibility or human-likeness. Our findings also reveal evidence of egocentric and allocentric language production when participants interact with speech agents. Interaction with disfluent speech agents appears to increase egocentric communication in comparison to fluent agents. Although the wide credibility intervals mean this effect is not clear-cut. We discuss potential interpretations of this finding, focusing on how disfluencies may impact partner models and language production in HMD.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2025
Emneord
Conversational Agents, Disfluency, Perspective-Taking
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-369073 (URN)10.1145/3719160.3737625 (DOI)001539402100012 ()2-s2.0-105011598139 (Scopus ID)
Konferanse
7th Conference on Conversational User Interfaces, CUI 2025, Waterloo, Canada, Jul 8 2025 - Jul 10 2025
Merknad

Part of ISBN 9798400715273

QC 20250922

Tilgjengelig fra: 2025-09-22 Laget: 2025-09-22 Sist oppdatert: 2025-09-22bibliografisk kontrollert
Székely, É., Mihajlik, P., Kádár, M. S. & Tóth, L. (2025). Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2735-2739). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2735-2739Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Personalised synthetic speech can enhance communication for Augmentative and Alternative Communication (AAC) users, but achieving high-quality, speaker-specific voices depends on various factors such as the condition causing speech loss, and availability of recorded speech. Recent advancements in large-scale zero-shot TTS models may change the data requirements, as they have the potential to adapt to a wider range of inputs. This paper explores the potential of these pretrained models in various data availability scenarios, from extensive spontaneous speech to minimal or no unaffected speech. We evaluate a state-of-the-art TTS system on a case study involving a stroke survivor with dysarthria, leveraging both typical and atypical speech data. Additionally, we introduce a novel interactive approach using dysarthric speech as an audio prompt to enable user-guided prosody adaptation.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
assistive communication, augmentative and alternative communication, dysarthric speech, speech synthesis
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372804 (URN)10.21437/Interspeech.2025-1726 (DOI)2-s2.0-105020070750 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251113

Tilgjengelig fra: 2025-11-13 Laget: 2025-11-13 Sist oppdatert: 2025-11-13bibliografisk kontrollert
Lameris, H., Gustafsson, J. & Székely, É. (2025). VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2295-2299). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2295-2299Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Voice quality is an often overlooked aspect of speech with many communicative functions. Voice quality conveys both paralinguistic and pragmatic information, such as signalling speaker stance and aids in grounding. In this paper, we present VoiceQualityVC, a tool that can manipulate the voice quality of both natural and synthesized speech using voice quality features including CPPS, H1-H2, and H1-A3. VoiceQualityVC is a research tool for perceptual experiments into voice quality and UX experiments for voice design. We perform an objective evaluation demonstrating the control of these features as well as subjective listening tests of the paralinguistic attributes of intimacy, valence, and investment. In these listening tests breathy voice was rated as more intimate and more invested than modal voice and creaky voice was rated as less intimate and less positive. The code and models can be found at https://github.com/Hfkml/VQVC.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
Paralinguistics, Pragmatics, Voice conversion, Voice quality
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372784 (URN)10.21437/Interspeech.2025-902 (DOI)2-s2.0-105020036268 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251120

Tilgjengelig fra: 2025-11-20 Laget: 2025-11-20 Sist oppdatert: 2025-11-20bibliografisk kontrollert
Hope, M. & Székely, É. (2025). Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 689-693). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Voices of 'cyborg awesomeness': Posthuman embodiment of nonbinary gender expression in AI speech technologies
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 689-693Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Speech-generating devices (SGDs) provide users with text-to-speech (TTS) voices that shape identity and self-expression. Current TTS voices enable self-expression but often lack customizable features for authentic voice embodiment, particularly for nonbinary SGD users seeking gender affirmation as existing TTS voices largely reproduce binary, cisgender speech patterns. This study examines how nonbinary SGD users embody, or disembody, synthetic voices and the factors influencing voice affirmation. Through a survey, we analyze experiences of nonbinary SGD users and their impressions of generated speech samples, investigating the role of technological possibilities in gender affirmation and voice embodiment. Findings inform the creation of more user-centered TTS technologies, and challenge dominant paradigms in speech technology, gesturing toward a posthumanist rethinking of voice as co-constructed between human and machine.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
human-computer interaction, nonbinary, posthuman, text-to-speech, transgender, voice embodiment
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372793 (URN)10.21437/Interspeech.2025-2229 (DOI)2-s2.0-105020063225 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251118

Tilgjengelig fra: 2025-11-18 Laget: 2025-11-18 Sist oppdatert: 2025-11-18bibliografisk kontrollert
Puhach, D., Payberah, A. H. & Székely, É. (2025). Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2058-2062). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2058-2062Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Similar to text-based Large Language Models (LLMs), Speech-LLMs exhibit emergent abilities and context awareness. However, whether these similarities extend to gender bias remains an open question. This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation. Unlike text-based models, which encode gendered associations implicitly, Speech-LLMs must produce a gendered voice, making speaker selection an explicit bias cue. We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts. If Bark's speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design. To test this, we construct two datasets: (i) Professions, containing gender-stereotyped occupations, and (ii) Gender-Colored Words, featuring gendered connotations. While Bark does not exhibit systematic bias, it demonstrates gender awareness and has some gender inclinations.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
gender bias, speech synthesis, speech-LLM, TTS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372806 (URN)10.21437/Interspeech.2025-1402 (DOI)2-s2.0-105020092426 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251113

Tilgjengelig fra: 2025-11-13 Laget: 2025-11-13 Sist oppdatert: 2025-11-13bibliografisk kontrollert
Székely, É. & Hope, M. (2024). An inclusive approach to creating a palette of synthetic voices for gender diversity. In: Proc. Interspeech 2024: . Paper presented at Interspeech (pp. 3070-3074).
Åpne denne publikasjonen i ny fane eller vindu >>An inclusive approach to creating a palette of synthetic voices for gender diversity
2024 (engelsk)Inngår i: Proc. Interspeech 2024, 2024, s. 3070-3074Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component.The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.

HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-367540 (URN)
Konferanse
Interspeech
Merknad

QC 20250731

Tilgjengelig fra: 2025-07-22 Laget: 2025-07-22 Sist oppdatert: 2025-07-31bibliografisk kontrollert
Székely, É. & Hope, M. (2024). An inclusive approach to creating a palette of synthetic voices for gender diversity. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 3070-3074). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>An inclusive approach to creating a palette of synthetic voices for gender diversity
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 3070-3074Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component. The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
alternative communication, augmentative, diversity, gender and speech, gender expansive, inclusion, nonbinary, speech generating devices, TTS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358869 (URN)10.21437/Interspeech.2024-1543 (DOI)001331850103037 ()2-s2.0-85214804000 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250128

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-05bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-1175-840X