kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
OverFlow: Putting flows on top of neural transducers for better TTS
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1886-681X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-0292-1164
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0001-9537-8505
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1399-6604
Show others and affiliations
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 4279-4283Conference paper, Published paper (Refereed)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Place, publisher, year, edition, pages
International Speech Communication Association , 2023. p. 4279-4283
Keywords [en]
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-338584DOI: 10.21437/Interspeech.2023-1996ISI: 001186650304087Scopus ID: 2-s2.0-85167953412OAI: oai:DiVA.org:kth-338584DiVA, id: diva2:1810297
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241014

Available from: 2023-11-07 Created: 2023-11-07 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Mehta, ShivamKirkland, AmbikaLameris, HarmBeskow, JonasSzékely, ÉvaHenter, Gustav Eje

Search in DiVA

By author/editor
Mehta, ShivamKirkland, AmbikaLameris, HarmBeskow, JonasSzékely, ÉvaHenter, Gustav Eje
By organisation
Speech, Music and Hearing, TMH
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 74 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf