kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A comparative study of self-supervised speech representationsin read and spontaneous TTS
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1643-1054
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-0397-6442
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-1175-840X
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts

Place, publisher, year, edition, pages
2023.
Keywords [en]
speech synthesis, self-supervised speech representation, spontaneous speech
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Other Engineering and Technologies
Research subject
Speech and Music Communication
Identifiers
URN: urn:nbn:se:kth:diva-328741ISBN: 979-8-3503-0261-5 (print)OAI: oai:DiVA.org:kth-328741DiVA, id: diva2:1765841
Conference
2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece
Projects
Digital Futures project Advanced Adaptive Intelligent Systems (AAIS)Swedish Research Council project Connected (VR-2019-05003)Swedish Research Council project Perception of speaker stance (VR-2020- 02396)Riksbankens Jubileumsfond project CAPTivating (P20-0298)Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation
Note

Accepted by the 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece

QC 20230620

Available from: 2023-06-12 Created: 2023-06-12 Last updated: 2025-02-18Bibliographically approved

Open Access in DiVA

fulltext(165 kB)136 downloads
File information
File name FULLTEXT01.pdfFile size 165 kBChecksum SHA-512
7188179a1bfb48159e49648cb6bb663ec548b4837a36aaefba909f14c722821e203e357312443f34e67b4fab69093ec6b36a9af61525da17195ca77078325c3f
Type fulltextMimetype application/pdf

Authority records

Wang, SiyangHenter, Gustav EjeGustafsson, JoakimSzékely, Éva

Search in DiVA

By author/editor
Wang, SiyangHenter, Gustav EjeGustafsson, JoakimSzékely, Éva
By organisation
Speech, Music and Hearing, TMH
Other Electrical Engineering, Electronic Engineering, Information EngineeringOther Engineering and Technologies

Search outside of DiVA

GoogleGoogle Scholar
Total: 136 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 145 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf