kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Neural HMMs are all you need (for high-quality attention-free TTS)
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1886-681X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1175-840X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1399-6604
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1643-1054
2022 (English)In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, 2022, p. 7457-7461Conference paper, Published paper (Refereed)
Abstract [en]

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.

Place, publisher, year, edition, pages
IEEE Signal Processing Society, 2022. p. 7457-7461
Series
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ISSN 2379-190X
Keywords [en]
seq2seq, attention, HMMs, duration modelling, acoustic modelling
National Category
Natural Language Processing Probability Theory and Statistics Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-312455DOI: 10.1109/ICASSP43922.2022.9746686ISI: 000864187907152Scopus ID: 2-s2.0-85131260082OAI: oai:DiVA.org:kth-312455DiVA, id: diva2:1659075
Conference
47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore
Funder
Knut and Alice Wallenberg Foundation, WASP
Note

Part of proceedings: ISBN 978-1-6654-0540-9

QC 20220601

Available from: 2022-05-18 Created: 2022-05-18 Last updated: 2025-02-01Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopusFinal version on arXivDemo webpageSource code

Authority records

Mehta, ShivamSzékely, ÉvaBeskow, JonasHenter, Gustav Eje

Search in DiVA

By author/editor
Mehta, ShivamSzékely, ÉvaBeskow, JonasHenter, Gustav Eje
By organisation
Speech, Music and Hearing, TMH
Natural Language ProcessingProbability Theory and StatisticsComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 141 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf