Open this publication in new window or tab >>Show others...
2022 (English)In: Proceedings of Fonetik 2022, 2022Conference paper, Published paper (Other academic)
Abstract [en]
Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.
National Category
Other Computer and Information Science Specific Languages
Identifiers
urn:nbn:se:kth:diva-313156 (URN)
Conference
Fonetik 2022, Stockholm 13-15 May, 202
Funder
Swedish Research Council, 2019-05003
Note
QC 20220726
2022-05-312022-05-312024-03-15Bibliographically approved