A comparative study of self-supervised speech representationsin read and spontaneous TTS
2023 (engelsk)Manuskript (preprint) (Annet vitenskapelig)
Abstract [en]
Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts
sted, utgiver, år, opplag, sider
2023.
Emneord [en]
speech synthesis, self-supervised speech representation, spontaneous speech
HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
URN: urn:nbn:se:kth:diva-328741ISBN: 979-8-3503-0261-5 (tryckt)OAI: oai:DiVA.org:kth-328741DiVA, id: diva2:1765841
Konferanse
2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece
Prosjekter
Digital Futures project Advanced Adaptive Intelligent Systems (AAIS)Swedish Research Council project Connected (VR-2019-05003)Swedish Research Council project Perception of speaker stance (VR-2020- 02396)Riksbankens Jubileumsfond project CAPTivating (P20-0298)Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation
Merknad
Accepted by the 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece
QC 20230620
2023-06-122023-06-122023-06-20bibliografisk kontrollert