kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (6 of 6) Show all publications
Lameris, H., Gustafsson, J. & Székely, É. (2023). Beyond style: synthesizing speech with pragmatic functions. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 3382-3386). International Speech Communication Association
Open this publication in new window or tab >>Beyond style: synthesizing speech with pragmatic functions
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 3382-3386Conference paper, Published paper (Refereed)
Abstract [en]

With recent advances in generative modelling, conversational systems are becoming more lifelike and capable of long, nuanced interactions. Text-to-Speech (TTS) is being tested in territories requiring natural-sounding speech that can mimic the complexities of human conversation. Hyper-realistic speech generation has been achieved, but a gap remains between the verbal behavior required for upscaled conversation, such as paralinguistic information and pragmatic functions, and comprehension of the acoustic prosodic correlates underlying these. Without this knowledge, reproducing these functions in speech has little value. We use prosodic correlates including spectral peaks, spectral tilt, and creak percentage for speech synthesis with the pragmatic functions of small talk, self-directed speech, advice, and instructions. We perform a MOS evaluation, and a suitability experiment in which our system outperforms a read-speech and conversational baseline.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
conversational TTS, pragmatic functions, speech synthesis
National Category
Language Technology (Computational Linguistics) General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-337836 (URN)10.21437/Interspeech.2023-2072 (DOI)2-s2.0-85171537616 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231009

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2023-10-09Bibliographically approved
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 4279-4283). International Speech Communication Association
Open this publication in new window or tab >>OverFlow: Putting flows on top of neural transducers for better TTS
Show others...
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 4279-4283Conference paper, Published paper (Refereed)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)2-s2.0-85167953412 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231107

Available from: 2023-11-07 Created: 2023-11-07 Last updated: 2023-11-07Bibliographically approved
Lameris, H., Mehta, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). Prosody-Controllable Spontaneous TTS with Neural HMMs. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP): . Paper presented at International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Prosody-Controllable Spontaneous TTS with Neural HMMs
Show others...
2023 (English)In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
Speech Synthesis, Prosodic Control, NeuralHMM, Spontaneous speech, Creaky voice
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-327893 (URN)10.1109/ICASSP49357.2023.10097200 (DOI)
Conference
International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Funder
Swedish Research Council, VR-2019-05003Swedish Research Council, VR-2020-02396Riksbankens Jubileumsfond, P20-0298Knut and Alice Wallenberg Foundation, WASP
Note

QC 20230602

Available from: 2023-06-01 Created: 2023-06-01 Last updated: 2023-06-02Bibliographically approved
Moell, B., O'Regan, J., Mehta, S., Kirkland, A., Lameris, H., Gustafsson, J. & Beskow, J. (2022). Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge. In: Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser (Ed.), The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments. Paper presented at 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022 (pp. 62-70). Marseille, France
Open this publication in new window or tab >>Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Show others...
2022 (English)In: The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments / [ed] Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser, Marseille, France, 2022, p. 62-70Conference paper, Published paper (Refereed)
Abstract [en]

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.

Place, publisher, year, edition, pages
Marseille, France: , 2022
Keywords
aphasia, data augmentation, phoneme transcription, phonemes, speech, speech data augmentation, wav2vec 2.0
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314262 (URN)2-s2.0-85145876107 (Scopus ID)
Conference
4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022
Note

QC 20220815

Available from: 2022-06-17 Created: 2022-06-17 Last updated: 2023-08-14Bibliographically approved
Lameris, H., Mehta, S., Henter, G. E., Kirkland, A., Moëll, B., O'Regan, J., . . . Székely, É. (2022). Spontaneous Neural HMM TTS with Prosodic Feature Modification. In: Proceedings of Fonetik 2022: . Paper presented at Fonetik 2022, Stockholm 13-15 May, 202.
Open this publication in new window or tab >>Spontaneous Neural HMM TTS with Prosodic Feature Modification
Show others...
2022 (English)In: Proceedings of Fonetik 2022, 2022Conference paper, Published paper (Other academic)
Abstract [en]

Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.

National Category
Other Computer and Information Science Specific Languages
Identifiers
urn:nbn:se:kth:diva-313156 (URN)
Conference
Fonetik 2022, Stockholm 13-15 May, 202
Funder
Swedish Research Council, 2019-05003
Note

QC 20220726

Available from: 2022-05-31 Created: 2022-05-31 Last updated: 2024-03-15Bibliographically approved
Kirkland, A., Lameris, H., Székely, É. & Gustafsson, J. (2022). Where's the uh, hesitation?: The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence. In: INTERSPEECH 2022: . Paper presented at Interspeech Conference, SEP 18-22, 2022, Incheon, SOUTH KOREA (pp. 4990-4994). International Speech Communication Association
Open this publication in new window or tab >>Where's the uh, hesitation?: The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence
2022 (English)In: INTERSPEECH 2022, International Speech Communication Association , 2022, p. 4990-4994Conference paper, Published paper (Refereed)
Abstract [en]

Much of the research investigating the perception of speaker certainty has relied on either attempting to elicit prosodic features in read speech, or artificial manipulation of recorded audio. Our novel method of controlling prosody in synthesized spontaneous speech provides a powerful tool for studying speech perception and can provide better insight into the interacting effects of prosodic features on perception while also paving the way for conversational systems which are more effectively able to engage in and respond to social behaviors. Here we have used this method to examine the combined impact of filled pause location, speech rate and f0 on the perception of speaker confidence. We found an additive effect of all three features. The most confident-sounding utterances had no filler, low f0 and high speech rate, while the least confident-sounding utterances had a medial filled pause, high f0 and low speech rate. Insertion of filled pauses had the strongest influence, but pitch and speaking rate could be used to more finely control the uncertainty cues in spontaneous speech synthesis.

Place, publisher, year, edition, pages
International Speech Communication Association, 2022
Series
Interspeech, ISSN 2308-457X
Keywords
speech synthesis, speech perception, expressive speech synthesis, paralinguistics
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-324862 (URN)10.21437/Interspeech.2022-10973 (DOI)000900724505034 ()2-s2.0-85140084915 (Scopus ID)
Conference
Interspeech Conference, SEP 18-22, 2022, Incheon, SOUTH KOREA
Note

QC 20230322

Available from: 2023-03-22 Created: 2023-03-22 Last updated: 2023-03-22Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-9537-8505

Search in DiVA

Show all publications