Open this publication in new window or tab >>Show others...
2024 (English)In: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, p. 1035-1039Conference paper, Published paper (Refereed)
Abstract [en]
Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS
Place, publisher, year, edition, pages
Leiden, The Netherlands: International Speech Communication Association, 2024
Keywords
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
National Category
Humanities and the Arts General Language Studies and Linguistics
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)
Conference
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projects
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Funder
Vinnova, (2018-02427
Note
QC 20240705
2024-07-032024-07-032024-07-05Bibliographically approved