kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1886-681X
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1356-9653
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1399-6604
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1175-840X
Show others and affiliations
2024 (English)In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 11341-11345Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2024. p. 11341-11345
Keywords [en]
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-350551DOI: 10.1109/ICASSP48485.2024.10448291ISI: 001396233804117Scopus ID: 2-s2.0-85195024093OAI: oai:DiVA.org:kth-350551DiVA, id: diva2:1884462
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Note

Part of ISBN 9798350344851

QC 20240716

Available from: 2024-07-16 Created: 2024-07-16 Last updated: 2025-03-26Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Mehta, ShivamTu, RuiboBeskow, JonasSzékely, ÉvaHenter, Gustav Eje

Search in DiVA

By author/editor
Mehta, ShivamTu, RuiboBeskow, JonasSzékely, ÉvaHenter, Gustav Eje
By organisation
Speech, Music and Hearing, TMH
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 158 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf