kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (7 of 7) Show all publications
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 4279-4283). International Speech Communication Association
Open this publication in new window or tab >>OverFlow: Putting flows on top of neural transducers for better TTS
Show others...
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 4279-4283Conference paper, Published paper (Refereed)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)001186650304087 ()2-s2.0-85167953412 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241014

Available from: 2023-11-07 Created: 2023-11-07 Last updated: 2025-02-07Bibliographically approved
Kirkland, A., Gustafsson, J. & Székely, É. (2023). Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 5217-5221). International Speech Communication Association
Open this publication in new window or tab >>Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 5217-5221Conference paper, Published paper (Refereed)
Abstract [en]

Disfluencies are a hallmark of spontaneous speech and play an important role in conversation, yet have been shown to negatively impact judgments about speakers. We explored the role of disfluencies in the perception of competence, sincerity and confidence in public speaking contexts, using synthesized spontaneous speech. In one experiment, listeners rated 30-40-second clips which varied in terms of whether they contained filled pauses, as well as the number and types of repetition. Both the overall number of disfluencies and the repetition type had an impact on competence and confidence, and disfluent speech was also rated as less sincere. In the second experiment, the negative effects of repetition type on competence were attenuated when participants attributed disfluency to anxiety.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
disfluencies, public speaking, speech perception, speech synthesis, spontaneous speech
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-337835 (URN)10.21437/Interspeech.2023-887 (DOI)001186650305074 ()2-s2.0-85171528981 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241014

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2024-10-14Bibliographically approved
Moell, B., O'Regan, J., Mehta, S., Kirkland, A., Lameris, H., Gustafsson, J. & Beskow, J. (2022). Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge. In: Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser (Ed.), The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments. Paper presented at 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022 (pp. 62-70). Marseille, France
Open this publication in new window or tab >>Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Show others...
2022 (English)In: The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments / [ed] Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser, Marseille, France, 2022, p. 62-70Conference paper, Published paper (Refereed)
Abstract [en]

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.

Place, publisher, year, edition, pages
Marseille, France: , 2022
Keywords
aphasia, data augmentation, phoneme transcription, phonemes, speech, speech data augmentation, wav2vec 2.0
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-314262 (URN)2-s2.0-85145876107 (Scopus ID)
Conference
4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022
Note

QC 20220815

Available from: 2022-06-17 Created: 2022-06-17 Last updated: 2023-08-14Bibliographically approved
Lameris, H., Mehta, S., Henter, G. E., Kirkland, A., Moëll, B., O'Regan, J., . . . Székely, É. (2022). Spontaneous Neural HMM TTS with Prosodic Feature Modification. In: Proceedings of Fonetik 2022: . Paper presented at Fonetik 2022, Stockholm 13-15 May, 202.
Open this publication in new window or tab >>Spontaneous Neural HMM TTS with Prosodic Feature Modification
Show others...
2022 (English)In: Proceedings of Fonetik 2022, 2022Conference paper, Published paper (Other academic)
Abstract [en]

Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.

National Category
Other Computer and Information Science Specific Languages
Identifiers
urn:nbn:se:kth:diva-313156 (URN)
Conference
Fonetik 2022, Stockholm 13-15 May, 202
Funder
Swedish Research Council, 2019-05003
Note

QC 20220726

Available from: 2022-05-31 Created: 2022-05-31 Last updated: 2024-03-15Bibliographically approved
Ward, N., Kirkland, A., Wlodarczak, M. & Székely, É. (2022). Two Pragmatic Functions of Breathy Voice in American English Conversation. In: Sónia Frota, Marisa Cruz and Marina Vigário (Ed.), Proceedings 11th International Conference on Speech Prosody: . Paper presented at 11th International Conference on Speech Prosody, Lisbon, Portugal, May 23-26, 2022 (pp. 82-86). International Speech Communication Association
Open this publication in new window or tab >>Two Pragmatic Functions of Breathy Voice in American English Conversation
2022 (English)In: Proceedings 11th International Conference on Speech Prosody / [ed] Sónia Frota, Marisa Cruz and Marina Vigário, International Speech Communication Association, 2022, p. 82-86Conference paper, Oral presentation with published abstract (Refereed)
Abstract [en]

Although the paralinguistic and phonological significance of breathy voice is well known, its pragmatic roles have been little studied. We report a systematic exploration of the pragmatic functions of breathy voice in American English, using a small corpus of casual conversations, using the Cepstral Peak Prominence Smoothed measure as an indicator of breathy voice, and using a common workflow to find prosodic constructions and identify their meanings. We found two prosodic constructions involving breathy voice. The first involves a short region of breathy voice in the midst of a region of low pitch, functioning to mark self-directed speech. The second involves breathy voice over several seconds, combined with a moment of wider pitch range leading to a high pitch over about a second, functioning to mark an attempt to establish common ground. These interpretations were confirmed by a perception experiment.

Place, publisher, year, edition, pages
International Speech Communication Association, 2022
Keywords
CPPS, voice quality, self-directed speech, common ground, grounding, explaining, prosodic constructions
National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-313391 (URN)10.21437/SpeechProsody.2022-17 (DOI)2-s2.0-85165879874 (Scopus ID)
Conference
11th International Conference on Speech Prosody, Lisbon, Portugal, May 23-26, 2022
Projects
Perception of speaker stance – using spontaneous speech synthesis to explore the contribution of prosody, context and speaker (VR-2020-02396)Prosodic functions of voice quality dynamics(VR-2019-02932)CAPTivating – Comparative Analysis of Public speaking with Text-to-speech (P20-0298)
Funder
Swedish Research Council, VR-2020-02396Swedish Research Council, VR-2019-02932Riksbankens Jubileumsfond, P20-0298
Note

QC 20220628

Available from: 2022-06-03 Created: 2022-06-03 Last updated: 2025-02-07Bibliographically approved
Kirkland, A., Lameris, H., Székely, É. & Gustafsson, J. (2022). Where's the uh, hesitation?: The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence. In: INTERSPEECH 2022: . Paper presented at Interspeech Conference, SEP 18-22, 2022, Incheon, SOUTH KOREA (pp. 4990-4994). International Speech Communication Association
Open this publication in new window or tab >>Where's the uh, hesitation?: The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence
2022 (English)In: INTERSPEECH 2022, International Speech Communication Association , 2022, p. 4990-4994Conference paper, Published paper (Refereed)
Abstract [en]

Much of the research investigating the perception of speaker certainty has relied on either attempting to elicit prosodic features in read speech, or artificial manipulation of recorded audio. Our novel method of controlling prosody in synthesized spontaneous speech provides a powerful tool for studying speech perception and can provide better insight into the interacting effects of prosodic features on perception while also paving the way for conversational systems which are more effectively able to engage in and respond to social behaviors. Here we have used this method to examine the combined impact of filled pause location, speech rate and f0 on the perception of speaker confidence. We found an additive effect of all three features. The most confident-sounding utterances had no filler, low f0 and high speech rate, while the least confident-sounding utterances had a medial filled pause, high f0 and low speech rate. Insertion of filled pauses had the strongest influence, but pitch and speaking rate could be used to more finely control the uncertainty cues in spontaneous speech synthesis.

Place, publisher, year, edition, pages
International Speech Communication Association, 2022
Series
Interspeech, ISSN 2308-457X
Keywords
speech synthesis, speech perception, expressive speech synthesis, paralinguistics
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-324862 (URN)10.21437/Interspeech.2022-10973 (DOI)000900724505034 ()2-s2.0-85140084915 (Scopus ID)
Conference
Interspeech Conference, SEP 18-22, 2022, Incheon, SOUTH KOREA
Note

QC 20230322

Available from: 2023-03-22 Created: 2023-03-22 Last updated: 2023-03-22Bibliographically approved
Kirkland, A., Włodarczak, M., Gustafsson, J. & Székely, É. (2021). Perception of smiling voice in spontaneous speech synthesis. In: Proceedings of Speech Synthesis Workshop (SSW11): . Paper presented at Speech Synthesis Workshop (SSW11), Budapest, Hungary, August 26-28, 2021 (pp. 108-112). International Speech Communication Association
Open this publication in new window or tab >>Perception of smiling voice in spontaneous speech synthesis
2021 (English)In: Proceedings of Speech Synthesis Workshop (SSW11), International Speech Communication Association , 2021, p. 108-112Conference paper, Published paper (Refereed)
Abstract [en]

Smiling during speech production has been shown to result in perceptible acoustic differences compared to non-smiling speech. However, there is a scarcity of research on the perception of “smiling voice” in synthesized spontaneous speech. In this study, we used a sequence-to-sequence neural text-tospeech system built on conversational data to produce utterances with the characteristics of spontaneous speech. Segments of speech following laughter, and the same utterances not preceded by laughter, were compared in a perceptual experiment after removing laughter and/or breaths from the beginning of the utterance to determine whether participants perceive the utterances preceded by laughter as sounding as if they were produced while smiling. The results showed that participants identified the post-laughter speech as smiling at a rate significantly greater than chance. Furthermore, the effect of content (positive/neutral/negative) was investigated. These results show that laughter, a spontaneous, non-elicited phenomenon in our model’s training data, can be used to synthesize expressive speech with the perceptual characteristics of smiling.

Place, publisher, year, edition, pages
International Speech Communication Association, 2021
Keywords
speech synthesis, text-to-speech, smiling voice, smiled speech
National Category
Natural Language Processing
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-329143 (URN)10.21437/SSW.2021-19 (DOI)
Conference
Speech Synthesis Workshop (SSW11), Budapest, Hungary, August 26-28, 2021
Funder
Swedish Research Council, VR-2020-02396Swedish Research Council, VR-2019- 05003Riksbankens Jubileumsfond, P20-0298
Note

QC 20230616

Available from: 2023-06-15 Created: 2023-06-15 Last updated: 2025-02-07Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-0292-1164

Search in DiVA

Show all publications