kth.sePublikationer KTH
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (6 of 6) Visa alla publikationer
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Öppna denna publikation i ny flik eller fönster >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Visa övriga...
2024 (Engelska)Ingår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferensbidrag, Publicerat paper (Refereegranskat)
Nationell ämneskategori
Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferens
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projekt
bodytalk
Anmärkning

QC 20241022

Tillgänglig från: 2024-10-22 Skapad: 2024-10-22 Senast uppdaterad: 2024-10-22Bibliografiskt granskad
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Visa övriga...
2024 (Engelska)Ingår i: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 1952-1964Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Nyckelord
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
Nationell ämneskategori
Signalbehandling Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Konferens
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Anmärkning

Part of ISBN 9798350365474

QC 20250715

Tillgänglig från: 2025-07-15 Skapad: 2025-07-15 Senast uppdaterad: 2025-08-13Bibliografiskt granskad
Tånnander, C., O'Regan, J., House, D., Edlund, J. & Beskow, J. (2024). Prosodic characteristics of English-accented Swedish neural TTS. In: Proceedings of Speech Prosody 2024: . Paper presented at Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024 (pp. 1035-1039). Leiden, The Netherlands: International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Prosodic characteristics of English-accented Swedish neural TTS
Visa övriga...
2024 (Engelska)Ingår i: Proceedings of Speech Prosody 2024, Leiden, The Netherlands: International Speech Communication Association , 2024, s. 1035-1039Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Neural text-to-speech synthesis (TTS) captures prosodicfeatures strikingly well, notwithstanding the lack of prosodiclabels in training or synthesis. We trained a voice on a singleSwedish speaker reading in Swedish and English. The resultingTTS allows us to control the degree of English-accentedness inSwedish sentences. English-accented Swedish commonlyexhibits well-known prosodic characteristics such as erroneoustonal accents and understated or missed durational differences.TTS quality was verified in three ways. Automatic speechrecognition resulted in low errors, verifying intelligibility.Automatic language classification had Swedish as the majoritychoice, while the likelihood of English increased with ourtargeted degree of English-accentedness. Finally, a rank ofperceived English-accentedness acquired through pairwisecomparisons by 20 human listeners demonstrated a strongcorrelation with the targeted English-accentedness.We report on phonetic and prosodic analyses of theaccented TTS. In addition to the anticipated segmentaldifferences, the analyses revealed temporal and prominencerelated variations coherent with Swedish spoken by Englishspeakers, such as missing Swedish stress patterns and overlyreduced unstressed syllables. With this work, we aim to gleaninsights into speech prosody from the latent prosodic featuresof neural TTS models. In addition, it will help implementspeech phenomena such as code switching in TTS

Ort, förlag, år, upplaga, sidor
Leiden, The Netherlands: International Speech Communication Association, 2024
Nyckelord
foreign-accented text-to-speech synthesis, neural text-to-speech synthesis, latent prosodic features
Nationell ämneskategori
Humaniora och konst Jämförande språkvetenskap och allmän lingvistik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-349946 (URN)10.21437/SpeechProsody.2024-209 (DOI)2-s2.0-105008058763 (Scopus ID)
Konferens
Speech Prosody 2024, Leiden, The Netherlands, 2-5 July 2024
Projekt
Deep learning based speech synthesis for reading aloud of lengthy and information rich texts in Swedish (2018-02427)Språkbanken Tal (2017-00626)
Forskningsfinansiär
Vinnova, (2018-02427
Anmärkning

QC 20240705

Tillgänglig från: 2024-07-03 Skapad: 2024-07-03 Senast uppdaterad: 2025-07-01Bibliografiskt granskad
O'Regan, J. (2022). Continued finetuning as single speaker adaptation. In: TMH QPSR: . Paper presented at Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference. Stockholm, 3
Öppna denna publikation i ny flik eller fönster >>Continued finetuning as single speaker adaptation
2022 (Engelska)Ingår i: TMH QPSR, Stockholm, 2022, Vol. 3Konferensbidrag, Muntlig presentation med publicerat abstract (Övrigt vetenskapligt)
Abstract [en]

The adaptation of unsupervised learning techniques to speech recognition have enabled the training of accurate models with less labelled training data, by finetuning a supervised classifier on top of a network pretrained using self-supervised methods. In this paper, we investigate if continuing the fine-tuning of such a model is suitable as a method of speaker adaptation for a single speaker, considering two kinds of user: the casual user, with data measurable in minutes, and the professional user, with data measurable in hours. We conduct experiments across a range of dataset sizes, in an attempt to provide a basis for estimates on how much data would be needed.

Ort, förlag, år, upplaga, sidor
Stockholm: , 2022
Nyckelord
speaker adaptation, finetuning, automatic speech recognition
Nationell ämneskategori
Annan elektroteknik och elektronik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314269 (URN)
Konferens
Fonetik 2022 - the XXXIIIrd Swedish Phonetics Conference
Anmärkning

QC 20220812

Tillgänglig från: 2022-06-17 Skapad: 2022-06-17 Senast uppdaterad: 2022-08-12Bibliografiskt granskad
Moell, B., O'Regan, J., Mehta, S., Kirkland, A., Lameris, H., Gustafsson, J. & Beskow, J. (2022). Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge. In: Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser (Ed.), The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments. Paper presented at 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022 (pp. 62-70). Marseille, France
Öppna denna publikation i ny flik eller fönster >>Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Visa övriga...
2022 (Engelska)Ingår i: The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments / [ed] Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser, Marseille, France, 2022, s. 62-70Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.

Ort, förlag, år, upplaga, sidor
Marseille, France: , 2022
Nyckelord
aphasia, data augmentation, phoneme transcription, phonemes, speech, speech data augmentation, wav2vec 2.0
Nationell ämneskategori
Annan elektroteknik och elektronik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314262 (URN)2-s2.0-85145876107 (Scopus ID)
Konferens
4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022
Anmärkning

QC 20220815

Tillgänglig från: 2022-06-17 Skapad: 2022-06-17 Senast uppdaterad: 2025-10-17Bibliografiskt granskad
Lameris, H., Mehta, S., Henter, G. E., Kirkland, A., Moëll, B., O'Regan, J., . . . Székely, É. (2022). Spontaneous Neural HMM TTS with Prosodic Feature Modification. In: Proceedings of Fonetik 2022: . Paper presented at Fonetik 2022, Stockholm 13-15 May, 202.
Öppna denna publikation i ny flik eller fönster >>Spontaneous Neural HMM TTS with Prosodic Feature Modification
Visa övriga...
2022 (Engelska)Ingår i: Proceedings of Fonetik 2022, 2022Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.

Nationell ämneskategori
Annan data- och informationsvetenskap Studier av enskilda språk
Identifikatorer
urn:nbn:se:kth:diva-313156 (URN)
Konferens
Fonetik 2022, Stockholm 13-15 May, 202
Forskningsfinansiär
Vetenskapsrådet, 2019-05003
Anmärkning

QC 20220726

Tillgänglig från: 2022-05-31 Skapad: 2022-05-31 Senast uppdaterad: 2024-03-15Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-2598-6868

Sök vidare i DiVA

Visa alla publikationer