kth.sePublikationer
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (8 of 8) Visa alla publikationer
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 4279-4283). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>OverFlow: Putting flows on top of neural transducers for better TTS
Visa övriga...
2023 (Engelska)Ingår i: Interspeech 2023, International Speech Communication Association , 2023, s. 4279-4283Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2023
Nyckelord
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)001186650304087 ()2-s2.0-85167953412 (Scopus ID)
Konferens
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Anmärkning

QC 20241014

Tillgänglig från: 2023-11-07 Skapad: 2023-11-07 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Kirkland, A., Gustafsson, J. & Székely, É. (2023). Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 5217-5221). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence
2023 (Engelska)Ingår i: Interspeech 2023, International Speech Communication Association , 2023, s. 5217-5221Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Disfluencies are a hallmark of spontaneous speech and play an important role in conversation, yet have been shown to negatively impact judgments about speakers. We explored the role of disfluencies in the perception of competence, sincerity and confidence in public speaking contexts, using synthesized spontaneous speech. In one experiment, listeners rated 30-40-second clips which varied in terms of whether they contained filled pauses, as well as the number and types of repetition. Both the overall number of disfluencies and the repetition type had an impact on competence and confidence, and disfluent speech was also rated as less sincere. In the second experiment, the negative effects of repetition type on competence were attenuated when participants attributed disfluency to anxiety.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2023
Nyckelord
disfluencies, public speaking, speech perception, speech synthesis, spontaneous speech
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-337835 (URN)10.21437/Interspeech.2023-887 (DOI)001186650305074 ()2-s2.0-85171528981 (Scopus ID)
Konferens
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Anmärkning

QC 20241014

Tillgänglig från: 2023-10-09 Skapad: 2023-10-09 Senast uppdaterad: 2024-10-14Bibliografiskt granskad
Lameris, H., Kirkland, A., Gustafsson, J. & Székely, É. (2023). Situating speech synthesis: Investigating contextual factors in the evaluation of conversational TTS. In: Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023: . Paper presented at 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023 (pp. 69-74). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Situating speech synthesis: Investigating contextual factors in the evaluation of conversational TTS
2023 (Engelska)Ingår i: Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023, International Speech Communication Association , 2023, s. 69-74Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Speech synthesis evaluation methods have lagged behind the development of TTS systems, with single sentence read-speech MOS naturalness evaluation on crowdsourcing platforms being the industry standard. For TTS to successfully be applied insocial contexts, evaluation methods need to be socially embedded in the situation where they will be deployed. Due to the time and cost constraints of conducting an in-person interaction evaluation for TTS, we examine the effect of introducing situational context and preceding sentence context to participants in a subjective listening experiment. We conduct a suitability evaluation for a robot game guide that explains game rules to participants using two synthesized spontaneous voices: an instruction-specific and a general spontaneous voice. Results indicate that the inclusion of context influences user ratings, highlighting the need for context-aware evaluations. However, the type of context did not significantly affect the results.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2023
Nyckelord
speech synthesis, text to speech, evaluation, social, context
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-365096 (URN)10.21437/SSW.2023-11 (DOI)
Konferens
12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023
Projekt
VR-2019-05003VR-2020-02396P20-0298
Forskningsfinansiär
Vetenskapsrådet, 2019-05003Vetenskapsrådet, 2020-02396Riksbankens Jubileumsfond, P20-0298
Anmärkning

QC 20250701

Tillgänglig från: 2025-06-18 Skapad: 2025-06-18 Senast uppdaterad: 2025-07-01Bibliografiskt granskad
Moell, B., O'Regan, J., Mehta, S., Kirkland, A., Lameris, H., Gustafsson, J. & Beskow, J. (2022). Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge. In: Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser (Ed.), The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments. Paper presented at 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022 (pp. 62-70). Marseille, France
Öppna denna publikation i ny flik eller fönster >>Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Visa övriga...
2022 (Engelska)Ingår i: The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments / [ed] Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser, Marseille, France, 2022, s. 62-70Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.

Ort, förlag, år, upplaga, sidor
Marseille, France: , 2022
Nyckelord
aphasia, data augmentation, phoneme transcription, phonemes, speech, speech data augmentation, wav2vec 2.0
Nationell ämneskategori
Annan elektroteknik och elektronik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314262 (URN)2-s2.0-85145876107 (Scopus ID)
Konferens
4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022
Anmärkning

QC 20220815

Tillgänglig från: 2022-06-17 Skapad: 2022-06-17 Senast uppdaterad: 2023-08-14Bibliografiskt granskad
Lameris, H., Mehta, S., Henter, G. E., Kirkland, A., Moëll, B., O'Regan, J., . . . Székely, É. (2022). Spontaneous Neural HMM TTS with Prosodic Feature Modification. In: Proceedings of Fonetik 2022: . Paper presented at Fonetik 2022, Stockholm 13-15 May, 202.
Öppna denna publikation i ny flik eller fönster >>Spontaneous Neural HMM TTS with Prosodic Feature Modification
Visa övriga...
2022 (Engelska)Ingår i: Proceedings of Fonetik 2022, 2022Konferensbidrag, Publicerat paper (Övrigt vetenskapligt)
Abstract [en]

Spontaneous speech synthesis is a complex enterprise, as the data has large variation, as well as speech disfluencies nor-mally omitted from read speech. These disfluencies perturb the attention mechanism present in most Text to Speech (TTS) sys-tems. Explicit modelling of prosodic features has enabled intu-itive prosody modification of synthesized speech. Most pros-ody-controlled TTS, however, has been trained on read-speech data that is not representative of spontaneous conversational prosody. The diversity in prosody in spontaneous speech data allows for more wide-ranging data-driven modelling of pro-sodic features. Additionally, prosody-controlled TTS requires extensive training data and GPU time which limits accessibil-ity. We use neural HMM TTS as it reduces the parameter size and can achieve fast convergence with stable alignments for spontaneous speech data. We modify neural HMM TTS to ena-ble prosodic control of the speech rate and fundamental fre-quency. We perform subjective evaluation of the generated speech of English and Swedish TTS models and objective eval-uation for English TTS. Subjective evaluation showed a signif-icant improvement in naturalness for Swedish for the mean prosody compared to a baseline with no prosody modification, and the objective evaluation showed greater variety in the mean of the per-utterance prosodic features.

Nationell ämneskategori
Annan data- och informationsvetenskap Studier av enskilda språk
Identifikatorer
urn:nbn:se:kth:diva-313156 (URN)
Konferens
Fonetik 2022, Stockholm 13-15 May, 202
Forskningsfinansiär
Vetenskapsrådet, 2019-05003
Anmärkning

QC 20220726

Tillgänglig från: 2022-05-31 Skapad: 2022-05-31 Senast uppdaterad: 2024-03-15Bibliografiskt granskad
Ward, N., Kirkland, A., Wlodarczak, M. & Székely, É. (2022). Two Pragmatic Functions of Breathy Voice in American English Conversation. In: Sónia Frota, Marisa Cruz and Marina Vigário (Ed.), Proceedings 11th International Conference on Speech Prosody: . Paper presented at 11th International Conference on Speech Prosody, Lisbon, Portugal, May 23-26, 2022 (pp. 82-86). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Two Pragmatic Functions of Breathy Voice in American English Conversation
2022 (Engelska)Ingår i: Proceedings 11th International Conference on Speech Prosody / [ed] Sónia Frota, Marisa Cruz and Marina Vigário, International Speech Communication Association, 2022, s. 82-86Konferensbidrag, Muntlig presentation med publicerat abstract (Refereegranskat)
Abstract [en]

Although the paralinguistic and phonological significance of breathy voice is well known, its pragmatic roles have been little studied. We report a systematic exploration of the pragmatic functions of breathy voice in American English, using a small corpus of casual conversations, using the Cepstral Peak Prominence Smoothed measure as an indicator of breathy voice, and using a common workflow to find prosodic constructions and identify their meanings. We found two prosodic constructions involving breathy voice. The first involves a short region of breathy voice in the midst of a region of low pitch, functioning to mark self-directed speech. The second involves breathy voice over several seconds, combined with a moment of wider pitch range leading to a high pitch over about a second, functioning to mark an attempt to establish common ground. These interpretations were confirmed by a perception experiment.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2022
Nyckelord
CPPS, voice quality, self-directed speech, common ground, grounding, explaining, prosodic constructions
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-313391 (URN)10.21437/SpeechProsody.2022-17 (DOI)2-s2.0-85165879874 (Scopus ID)
Konferens
11th International Conference on Speech Prosody, Lisbon, Portugal, May 23-26, 2022
Projekt
Perception of speaker stance – using spontaneous speech synthesis to explore the contribution of prosody, context and speaker (VR-2020-02396)Prosodic functions of voice quality dynamics(VR-2019-02932)CAPTivating – Comparative Analysis of Public speaking with Text-to-speech (P20-0298)
Forskningsfinansiär
Vetenskapsrådet, VR-2020-02396Vetenskapsrådet, VR-2019-02932Riksbankens Jubileumsfond, P20-0298
Anmärkning

QC 20220628

Tillgänglig från: 2022-06-03 Skapad: 2022-06-03 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Kirkland, A., Lameris, H., Székely, É. & Gustafsson, J. (2022). Where's the uh, hesitation?: The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence. In: INTERSPEECH 2022: . Paper presented at Interspeech Conference, SEP 18-22, 2022, Incheon, SOUTH KOREA (pp. 4990-4994). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Where's the uh, hesitation?: The interplay between filled pause location, speech rate and fundamental frequency in perception of confidence
2022 (Engelska)Ingår i: INTERSPEECH 2022, International Speech Communication Association , 2022, s. 4990-4994Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Much of the research investigating the perception of speaker certainty has relied on either attempting to elicit prosodic features in read speech, or artificial manipulation of recorded audio. Our novel method of controlling prosody in synthesized spontaneous speech provides a powerful tool for studying speech perception and can provide better insight into the interacting effects of prosodic features on perception while also paving the way for conversational systems which are more effectively able to engage in and respond to social behaviors. Here we have used this method to examine the combined impact of filled pause location, speech rate and f0 on the perception of speaker confidence. We found an additive effect of all three features. The most confident-sounding utterances had no filler, low f0 and high speech rate, while the least confident-sounding utterances had a medial filled pause, high f0 and low speech rate. Insertion of filled pauses had the strongest influence, but pitch and speaking rate could be used to more finely control the uncertainty cues in spontaneous speech synthesis.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2022
Serie
Interspeech, ISSN 2308-457X
Nyckelord
speech synthesis, speech perception, expressive speech synthesis, paralinguistics
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-324862 (URN)10.21437/Interspeech.2022-10973 (DOI)000900724505034 ()2-s2.0-85140084915 (Scopus ID)
Konferens
Interspeech Conference, SEP 18-22, 2022, Incheon, SOUTH KOREA
Anmärkning

QC 20230322

Tillgänglig från: 2023-03-22 Skapad: 2023-03-22 Senast uppdaterad: 2023-03-22Bibliografiskt granskad
Kirkland, A., Włodarczak, M., Gustafsson, J. & Székely, É. (2021). Perception of smiling voice in spontaneous speech synthesis. In: Proceedings of Speech Synthesis Workshop (SSW11): . Paper presented at Speech Synthesis Workshop (SSW11), Budapest, Hungary, August 26-28, 2021 (pp. 108-112). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Perception of smiling voice in spontaneous speech synthesis
2021 (Engelska)Ingår i: Proceedings of Speech Synthesis Workshop (SSW11), International Speech Communication Association , 2021, s. 108-112Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Smiling during speech production has been shown to result in perceptible acoustic differences compared to non-smiling speech. However, there is a scarcity of research on the perception of “smiling voice” in synthesized spontaneous speech. In this study, we used a sequence-to-sequence neural text-tospeech system built on conversational data to produce utterances with the characteristics of spontaneous speech. Segments of speech following laughter, and the same utterances not preceded by laughter, were compared in a perceptual experiment after removing laughter and/or breaths from the beginning of the utterance to determine whether participants perceive the utterances preceded by laughter as sounding as if they were produced while smiling. The results showed that participants identified the post-laughter speech as smiling at a rate significantly greater than chance. Furthermore, the effect of content (positive/neutral/negative) was investigated. These results show that laughter, a spontaneous, non-elicited phenomenon in our model’s training data, can be used to synthesize expressive speech with the perceptual characteristics of smiling.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2021
Nyckelord
speech synthesis, text-to-speech, smiling voice, smiled speech
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-329143 (URN)10.21437/SSW.2021-19 (DOI)
Konferens
Speech Synthesis Workshop (SSW11), Budapest, Hungary, August 26-28, 2021
Forskningsfinansiär
Vetenskapsrådet, VR-2020-02396Vetenskapsrådet, VR-2019- 05003Riksbankens Jubileumsfond, P20-0298
Anmärkning

QC 20230616

Tillgänglig från: 2023-06-15 Skapad: 2023-06-15 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0003-0292-1164

Sök vidare i DiVA

Visa alla publikationer