Open this publication in new window or tab >>2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 2151-2152Conference paper, Published paper (Refereed)
Abstract [en]
A new wave of speech foundation models is emerging, capable of processing spoken language directly from audio. These models promise more expressive and emotionally aware interactions by retaining prosodic information throughout conversations. 'Hear Me Out' evaluates their ability to preserve crucial vocal cues, enabling users to explore how variations in speaker characteristics and paralinguistic features influence AI responses. Through real-time voice conversion, users can ask a question and then re-ask it with a modified one, immediately observing differences in response tone, phrasing, and behavior. The system presents paired responses side by side, offering direct comparisons of AI interpretations of both the original and transformed voices, thereby highlighting potential biases. By creating inquiry into speaker modeling, contextual understanding, and fairness, this immersive experience encourages users to reflect on identity, voice, and also promote inclusive future research.
Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
bias in conversational AI, speech-to-speech conversational AI, voice conversion
National Category
Natural Language Processing Human Computer Interaction Computer Sciences Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-372786 (URN)2-s2.0-105020052310 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note
QC 20251120
2025-11-202025-11-202025-11-20Bibliographically approved