Öppna denna publikation i ny flik eller fönster >>Visa övriga...
2022 (Engelska)Ingår i: Proceedings of Fonetik 2022, 2022Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]
Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.
Nationell ämneskategori
Annan data- och informationsvetenskap Studier av enskilda språk
Identifikatorer
urn:nbn:se:kth:diva-313156 (URN)
Konferens
Fonetik 2022, Stockholm 13-15 May, 202
Forskningsfinansiär
Vetenskapsrådet, 2019-05003
Anmärkning
QC 20220726
2022-05-312022-05-312024-03-15Bibliografiskt granskad