kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1643-1054
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-0397-6442
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1175-840X
2023 (English)In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE) , 2023.
Keywords [en]
self-supervised speech representation, speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:kth:diva-335090DOI: 10.1109/ICASSPW59220.2023.10193157ISI: 001046933700056Scopus ID: 2-s2.0-85165623363OAI: oai:DiVA.org:kth-335090DiVA, id: diva2:1793234
Conference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023
Note

Part of ISBN 9798350302615

QC 20230831

Available from: 2023-08-31 Created: 2023-08-31 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Wang, SiyangHenter, Gustav EjeGustafsson, JoakimSzékely, Éva

Search in DiVA

By author/editor
Wang, SiyangHenter, Gustav EjeGustafsson, JoakimSzékely, Éva
By organisation
Speech, Music and Hearing, TMH
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 168 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf