kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Spontaneous Neural HMM TTS with Prosodic Feature Modification
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1643-1054
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
Show others and affiliations
2022 (English)In: Proceedings of Fonetik 2022, 2022Conference paper, Published paper (Other academic)
Abstract [en]

Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.

Place, publisher, year, edition, pages
2022.
National Category
Other Computer and Information Science Specific Languages
Identifiers
URN: urn:nbn:se:kth:diva-313156OAI: oai:DiVA.org:kth-313156DiVA, id: diva2:1662177
Conference
Fonetik 2022, Stockholm 13-15 May, 202
Funder
Swedish Research Council, 2019-05003
Note

QC 20220726

Available from: 2022-05-31 Created: 2022-05-31 Last updated: 2024-03-15Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Lameris, HarmMehta, ShivamHenter, Gustav EjeKirkland, AmbikaMoëll, BirgerO'Regan, JimGustafsson, JoakimSzékely, Éva

Search in DiVA

By author/editor
Lameris, HarmMehta, ShivamHenter, Gustav EjeKirkland, AmbikaMoëll, BirgerO'Regan, JimGustafsson, JoakimSzékely, Éva
By organisation
Speech, Music and Hearing, TMH
Other Computer and Information ScienceSpecific Languages

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 238 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf