kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Prosody-Controllable Spontaneous TTS with Neural HMMs
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9537-8505
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH. KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Robotics, Perception and Learning, RPL.ORCID iD: 0000-0002-1643-1054
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-0397-6442
Show others and affiliations
2023 (English)In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2023.
Keywords [en]
Speech Synthesis, Prosodic Control, NeuralHMM, Spontaneous speech, Creaky voice
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:kth:diva-327893DOI: 10.1109/ICASSP49357.2023.10097200Scopus ID: 2-s2.0-85174033357OAI: oai:DiVA.org:kth-327893DiVA, id: diva2:1761460
Conference
International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Funder
Swedish Research Council, VR-2019-05003Swedish Research Council, VR-2020-02396Riksbankens Jubileumsfond, P20-0298Knut and Alice Wallenberg Foundation, WASP
Note

QC 20230602

Available from: 2023-06-01 Created: 2023-06-01 Last updated: 2024-08-28Bibliographically approved

Open Access in DiVA

fulltext(1134 kB)461 downloads
File information
File name FULLTEXT01.pdfFile size 1134 kBChecksum SHA-512
12db14d1b95aa139758dc9d8a1f33ba44641171cbfc9e5fd4fb069fe3b656891244d3a04f368bfd062cf808cee26fcddf7a14de31401542e68ebf20dc6e4c06e
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopushttps://ieeexplore.ieee.org/document/10097200

Authority records

Lameris, HarmMehta, ShivamHenter, Gustav EjeGustafsson, JoakimSzékely, Éva

Search in DiVA

By author/editor
Lameris, HarmMehta, ShivamHenter, Gustav EjeGustafsson, JoakimSzékely, Éva
By organisation
Speech, Music and Hearing, TMHRobotics, Perception and Learning, RPL
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 463 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 247 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf