kth.sePublications
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 63) Show all publications
Székely, É. & Hope, M. (2024). An inclusive approach to creating a palette of synthetic voices for gender diversity. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 3070-3074). International Speech Communication Association
Open this publication in new window or tab >>An inclusive approach to creating a palette of synthetic voices for gender diversity
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 3070-3074Conference paper, Published paper (Refereed)
Abstract [en]

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component. The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
alternative communication, augmentative, diversity, gender and speech, gender expansive, inclusion, nonbinary, speech generating devices, TTS
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358869 (URN)10.21437/Interspeech.2024-1543 (DOI)2-s2.0-85214804000 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-28Bibliographically approved
Francis, J., Székely, É. & Gustafsson, J. (2024). ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1001-1002). International Speech Communication Association
Open this publication in new window or tab >>ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1001-1002Conference paper, Published paper (Refereed)
Abstract [en]

Recent developments in generative language modeling and conversational Text-to-Speech present transformative potential for enhancing Augmentative and Alternative Communication (AAC) devices. Practical application of these technologies requires extensive research and testing. To address this, we introduce ConnecTone, a modular platform designed for rapid integration and testing of language generation and speech technology. ConnecTone implements context-sensitive generative text prediction, using conversational context from Automatic Speech Recognition inputs. The system incorporates a neural TTS that supports interpolation between reading and spontaneous conversational styles, along with adjustable prosodic features. These speech characteristics are predicted using Large Language Models, but can be adjusted by users for individual needs. We anticipate ConnecTone will enable us to rapidly evaluate and implement innovations, thereby contributing to faster benefit delivery to AAC users.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
AAC, Human-Computer Interaction, Speech Synthesis, TTS
National Category
Natural Language Processing Computer Sciences Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-358873 (URN)2-s2.0-85214814511 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250124

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-24Bibliographically approved
Wang, S., Székely, É. & Gustafsson, J. (2024). Contextual Interactive Evaluation of TTS Models in Dialogue Systems. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2965-2969). International Speech Communication Association
Open this publication in new window or tab >>Contextual Interactive Evaluation of TTS Models in Dialogue Systems
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2965-2969Conference paper, Published paper (Refereed)
Abstract [en]

Evaluation of text-to-speech (TTS) models is currently dominated by Mean-Opinion-Score (MOS) listening test, but MOS has been increasingly questioned for its validity. MOS tests place listeners in a passive setup, in which they do not actively interact with the TTS and usually evaluate isolated utterances without context. Thus it gives no indication for how well a TTS model suits an interactive application like spoken dialogue system, in which the capability of generating appropriate speech in the dialogue context is paramount. We aim to take a first step towards addressing this shortcoming by evaluating several state-of-the-art neural TTS models, including one that adapts to dialogue context, in a custom-built spoken dialogue system. We present system design, experiment setup, and results. Our work is the first to evaluate TTS in contextual dialogue system interactions. We also discuss the shortcomings and future opportunities of the proposed evaluation paradigm.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
evaluation methodology, human-computer interaction, spoken dialogue system, text-to-speech
National Category
Natural Language Processing Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-358876 (URN)10.21437/Interspeech.2024-1008 (DOI)2-s2.0-85214809755 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-02-13Bibliographically approved
Lameris, H., Gustafsson, J. & Székely, É. (2024). CreakVC: A Voice Conversion Tool for Modulating Creaky Voice. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1005-1006). International Speech Communication Association
Open this publication in new window or tab >>CreakVC: A Voice Conversion Tool for Modulating Creaky Voice
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1005-1006Conference paper, Published paper (Refereed)
Abstract [en]

We introduce a human-in-the-loop one-shot voice conversion tool called CreakVC designed to modulate the level of creaky voice in the converted speech. Creaky voice, often used by speakers to convey sociolinguistic cues, presents challenges to speech processing due to its complex phonation characteristics. The primary goal of CreakVC is to enable in-depth research into how these cues are perceived, using systematic perceptual studies. CreakVC provides access to a diverse range of voice identities exhibiting creaky voice, while maintaining consistency in other parameters. We developed a spectrogram-frame level creak representation using CreaPy and finetuned FreeVC, a one-shot voice conversion tool, by conditioning the speaker embedding and the self-supervised audio representation with the creak representation. An integrated plotting feature allows users to visualize and manipulate portions of speech for precise adjustments of creaky phonation levels. Beyond research, CreakVC has potential applications in voice-interactive systems and multimedia production.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
creaky voice, TTS, voice conversion
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-358875 (URN)2-s2.0-85214828772 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250124

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-24Bibliographically approved
Wang, S. & Székely, É. (2024). Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 6464-6474). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 6464-6474Conference paper, Published paper (Refereed)
Abstract [en]

Recent advances in generative language modeling applied to discrete speech tokens presented a new avenue for text-to-speech (TTS) synthesis. These speech language models (SLMs), similarly to their textual counterparts, are scalable, probabilistic, and context-aware. While they can produce diverse and natural outputs, they sometimes face issues such as unintelligibility and the inclusion of non-speech noises or hallucination. As the adoption of this innovative paradigm in speech synthesis increases, there is a clear need for an in-depth evaluation of its capabilities and limitations. In this paper, we evaluate TTS from a discrete token-based SLM, through both automatic metrics and listening tests. We examine five key dimensions: speaking style, intelligibility, speaker consistency, prosodic variation, spontaneous behaviour. Our results highlight the model's strength in generating varied prosody and spontaneous outputs. It is also rated higher in naturalness and context appropriateness in listening tests compared to a conventional TTS. However, the model's performance in intelligibility and speaker consistency lags behind traditional TTS. Additionally, we show that increasing the scale of SLMs offers a modest boost in robustness. Our findings aim to serve as a benchmark for future advancements in generative SLMs for speech synthesis.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
discrete speech token, generative speech language model, text-to-speech evaluation
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-348777 (URN)2-s2.0-85195990390 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

Part of ISBN 9782493814104

QC 20240701

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-02-07Bibliographically approved
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Show others...
2024 (English)In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 11341-11345Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Note

Part of ISBN 9798350344851

QC 20240716

Available from: 2024-07-16 Created: 2024-07-16 Last updated: 2025-03-26Bibliographically approved
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Open this publication in new window or tab >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Show others...
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2285-2289Conference paper, Published paper (Refereed)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)2-s2.0-85214793947 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-02-25Bibliographically approved
Lameris, H., Székely, É. & Gustafsson, J. (2024). The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 16058-16065). European Language Resources Association (ELRA)
Open this publication in new window or tab >>The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 16058-16065Conference paper, Published paper (Refereed)
Abstract [en]

Recent advancements in spontaneous text-to-speech (TTS) have enabled the realistic synthesis of creaky voice, a voice quality known for its diverse pragmatic and paralinguistic functions. In this study, we used synthesized creaky voice in perceptual tests, to explore how listeners without formal training perceive two distinct types of creaky voice. We annotated a spontaneous speech corpus using creaky voice detection tools and modified a neural TTS engine with a creaky phonation embedding to control the presence of creaky phonation in the synthesized speech. We performed an objective analysis using a creak detection tool which revealed significant differences in creaky phonation levels between the two creaky voice types and modal voice. Two subjective listening experiments were performed to investigate the effect of creaky voice on perceived certainty, valence, sarcasm, and turn finality. Participants rated non-positional creak as less certain, less positive, and more indicative of turn finality, while positional creak was rated significantly more turn final compared to modal phonation.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
creaky voice, speech perception, speech synthesis, voice quality
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-348782 (URN)2-s2.0-85195915140 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

QC 20240701

Part of ISBN 978-249381410-4

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2024-07-01Bibliographically approved
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Unified speech and gesture synthesis using flow matching
Show others...
2024 (English)In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 8220-8224Conference paper, Published paper (Refereed)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
National Category
Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Note

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Available from: 2025-04-02 Created: 2025-04-02 Last updated: 2025-04-09Bibliographically approved
O'Mahony, J., Lai, C. & Székely, É. (2024). "Well", what can you do with messy data? Exploring the prosody and pragmatic function of the discourse marker "well" with found data and speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 4084-4088). International Speech Communication Association
Open this publication in new window or tab >>"Well", what can you do with messy data? Exploring the prosody and pragmatic function of the discourse marker "well" with found data and speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 4084-4088Conference paper, Published paper (Refereed)
Abstract [en]

Recently, there has been growing interest in the synthesis of conversational speech prosody. Conversational prosody is variable and carries many pragmatic functions. As speech synthesis research moves to using large amounts of untranscribed data, it is crucial that we understand the subtle pragmatic differences prosody can make. This study focuses on discourse markers, which are linguistic elements that perform various communicative functions, with their specific roles often linked to their prosodic realisation. In this paper, we explore the prosodic realisation of well using an unlabelled corpus of conversational speech. We use clustering to explore the variation in its prosodic realisation and identify common patterns in a data-driven manner. We synthesise the cluster centroids using controllable speech synthesis. Finally, we evaluate how the prosodic realisation of well affects the meaning of an utterance.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
conversational speech synthesis, pragmatics, prosody
National Category
General Language Studies and Linguistics Natural Language Processing Computer Sciences Specific Languages
Identifiers
urn:nbn:se:kth:diva-358879 (URN)10.21437/Interspeech.2024-2122 (DOI)2-s2.0-85214836302 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-27Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-1175-840X

Search in DiVA

Show all publications