Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 12) Visa alla publikasjoner
Lameris, H., Gustafsson, J. & Székely, É. (2025). VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 2295-2299). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 2295-2299Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Voice quality is an often overlooked aspect of speech with many communicative functions. Voice quality conveys both paralinguistic and pragmatic information, such as signalling speaker stance and aids in grounding. In this paper, we present VoiceQualityVC, a tool that can manipulate the voice quality of both natural and synthesized speech using voice quality features including CPPS, H1-H2, and H1-A3. VoiceQualityVC is a research tool for perceptual experiments into voice quality and UX experiments for voice design. We perform an objective evaluation demonstrating the control of these features as well as subjective listening tests of the paralinguistic attributes of intimacy, valence, and investment. In these listening tests breathy voice was rated as more intimate and more invested than modal voice and creaky voice was rated as less intimate and less positive. The code and models can be found at https://github.com/Hfkml/VQVC.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
Paralinguistics, Pragmatics, Voice conversion, Voice quality
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372784 (URN)10.21437/Interspeech.2025-902 (DOI)2-s2.0-105020036268 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251120

Tilgjengelig fra: 2025-11-20 Laget: 2025-11-20 Sist oppdatert: 2025-11-20bibliografisk kontrollert
Lameris, H., Gustafsson, J. & Székely, É. (2024). CreakVC: A Voice Conversion Tool for Modulating Creaky Voice. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1005-1006). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>CreakVC: A Voice Conversion Tool for Modulating Creaky Voice
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 1005-1006Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce a human-in-the-loop one-shot voice conversion tool called CreakVC designed to modulate the level of creaky voice in the converted speech. Creaky voice, often used by speakers to convey sociolinguistic cues, presents challenges to speech processing due to its complex phonation characteristics. The primary goal of CreakVC is to enable in-depth research into how these cues are perceived, using systematic perceptual studies. CreakVC provides access to a diverse range of voice identities exhibiting creaky voice, while maintaining consistency in other parameters. We developed a spectrogram-frame level creak representation using CreaPy and finetuned FreeVC, a one-shot voice conversion tool, by conditioning the speaker embedding and the self-supervised audio representation with the creak representation. An integrated plotting feature allows users to visualize and manipulate portions of speech for precise adjustments of creaky phonation levels. Beyond research, CreakVC has potential applications in voice-interactive systems and multimedia production.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
creaky voice, TTS, voice conversion
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358875 (URN)2-s2.0-85214828772 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Merknad

QC 20250124

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-01-24bibliografisk kontrollert
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Vise andre…
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2285-2289Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)001331850102086 ()2-s2.0-85214793947 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250127

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-05bibliografisk kontrollert
Lameris, H., Székely, É. & Gustafsson, J. (2024). The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 16058-16065). European Language Resources Association (ELRA)
Åpne denne publikasjonen i ny fane eller vindu >>The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS
2024 (engelsk)Inngår i: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, s. 16058-16065Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Recent advancements in spontaneous text-to-speech (TTS) have enabled the realistic synthesis of creaky voice, a voice quality known for its diverse pragmatic and paralinguistic functions. In this study, we used synthesized creaky voice in perceptual tests, to explore how listeners without formal training perceive two distinct types of creaky voice. We annotated a spontaneous speech corpus using creaky voice detection tools and modified a neural TTS engine with a creaky phonation embedding to control the presence of creaky phonation in the synthesized speech. We performed an objective analysis using a creak detection tool which revealed significant differences in creaky phonation levels between the two creaky voice types and modal voice. Two subjective listening experiments were performed to investigate the effect of creaky voice on perceived certainty, valence, sarcasm, and turn finality. Participants rated non-positional creak as less certain, less positive, and more indicative of turn finality, while positional creak was rated significantly more turn final compared to modal phonation.

sted, utgiver, år, opplag, sider
European Language Resources Association (ELRA), 2024
Emneord
creaky voice, speech perception, speech synthesis, voice quality
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-348782 (URN)2-s2.0-85195915140 (Scopus ID)
Konferanse
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Merknad

QC 20240701

Part of ISBN 978-249381410-4

Tilgjengelig fra: 2024-06-27 Laget: 2024-06-27 Sist oppdatert: 2024-07-01bibliografisk kontrollert
Lameris, H., Gustafsson, J. & Székely, É. (2023). Beyond style: synthesizing speech with pragmatic functions. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 3382-3386). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Beyond style: synthesizing speech with pragmatic functions
2023 (engelsk)Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 3382-3386Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

With recent advances in generative modelling, conversational systems are becoming more lifelike and capable of long, nuanced interactions. Text-to-Speech (TTS) is being tested in territories requiring natural-sounding speech that can mimic the complexities of human conversation. Hyper-realistic speech generation has been achieved, but a gap remains between the verbal behavior required for upscaled conversation, such as paralinguistic information and pragmatic functions, and comprehension of the acoustic prosodic correlates underlying these. Without this knowledge, reproducing these functions in speech has little value. We use prosodic correlates including spectral peaks, spectral tilt, and creak percentage for speech synthesis with the pragmatic functions of small talk, self-directed speech, advice, and instructions. We perform a MOS evaluation, and a suitability experiment in which our system outperforms a read-speech and conversational baseline.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2023
Emneord
conversational TTS, pragmatic functions, speech synthesis
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-337836 (URN)10.21437/Interspeech.2023-2072 (DOI)001186650303108 ()2-s2.0-85171537616 (Scopus ID)
Konferanse
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Merknad

QC 20241011

Tilgjengelig fra: 2023-10-09 Laget: 2023-10-09 Sist oppdatert: 2025-02-01bibliografisk kontrollert
Lameris, H., Wlodarczak, M., Gustafsson, J. & Székely, É. (2023). Neural speech synthesis with controllable creaky voice style. In: Radek Skarnitzl; Jan Volín (Ed.), Proceedings of the 20th International Congress of Phonetic Sciences - ICPhS 2023: . Paper presented at 20th International Congress of Phonetic Sciences (ICPhS), Prague Congress Center, Czech Republic, August 7–11, 2023 (pp. 3141-3145). , Article ID 717.
Åpne denne publikasjonen i ny fane eller vindu >>Neural speech synthesis with controllable creaky voice style
2023 (engelsk)Inngår i: Proceedings of the 20th International Congress of Phonetic Sciences - ICPhS 2023 / [ed] Radek Skarnitzl; Jan Volín, 2023, s. 3141-3145, artikkel-id 717Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

The use of creaky voice, or vocal fry in speech has been extensively studied for its linguistic, paralinguistic, and sociolinguistic functions. However, much of the existing research on this topic is fragmented and often contradictory. In order to gain a deeper understanding of the communicative functions of creaky voice, we propose the use of comparative perceptual studies with natural sounding speech synthesis. We present a neural speech synthesizer that produces highly naturalsounding synthetic speech with controllable creaky voice styles. In a subjective listening experiment, speech experts were able to identify the presence and intensity of creaky voice produced by the synthesizer. Our results suggest that neural speech synthesis can be a valuable tool in furthering our understanding of the communicative functions of creaky voice.

Emneord
creaky voice, vocal fry, speech synthesis, TTS, spontaneous speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-364768 (URN)
Konferanse
20th International Congress of Phonetic Sciences (ICPhS), Prague Congress Center, Czech Republic, August 7–11, 2023
Prosjekter
Connected (VR-2019-05003)STANCE (VR- 2020-02396)Prosodic functions of voice quality dynamics (VR-2019-02932)CAPTivating (P20-0298)
Forskningsfinansiär
Swedish Research Council, 2019-02932
Merknad

QC 20250616

Tilgjengelig fra: 2025-06-16 Laget: 2025-06-16 Sist oppdatert: 2025-06-16bibliografisk kontrollert
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 4279-4283). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>OverFlow: Putting flows on top of neural transducers for better TTS
Vise andre…
2023 (engelsk)Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 4279-4283Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2023
Emneord
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)001186650304087 ()2-s2.0-85167953412 (Scopus ID)
Konferanse
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Merknad

QC 20241014

Tilgjengelig fra: 2023-11-07 Laget: 2023-11-07 Sist oppdatert: 2025-08-13bibliografisk kontrollert
Lameris, H., Mehta, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). Prosody-Controllable Spontaneous TTS with Neural HMMs. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP): . Paper presented at International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Rhodes Island, Greece, June 4-10, 2023. Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Prosody-Controllable Spontaneous TTS with Neural HMMs
Vise andre…
2023 (engelsk)Inngår i: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2023
Emneord
Speech Synthesis, Prosodic Control, NeuralHMM, Spontaneous speech, Creaky voice
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-327893 (URN)10.1109/ICASSP49357.2023.10097200 (DOI)2-s2.0-86000382699 (Scopus ID)
Konferanse
International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Rhodes Island, Greece, June 4-10, 2023
Forskningsfinansiär
Swedish Research Council, VR-2019-05003Swedish Research Council, VR-2020-02396Riksbankens Jubileumsfond, P20-0298Knut and Alice Wallenberg Foundation, WASP
Merknad

Part of ISBN 9781728163277

QC 20250623

Tilgjengelig fra: 2023-06-01 Laget: 2023-06-01 Sist oppdatert: 2025-06-23bibliografisk kontrollert
Lameris, H., Kirkland, A., Gustafsson, J. & Székely, É. (2023). Situating speech synthesis: Investigating contextual factors in the evaluation of conversational TTS. In: Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023: . Paper presented at 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023 (pp. 69-74). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Situating speech synthesis: Investigating contextual factors in the evaluation of conversational TTS
2023 (engelsk)Inngår i: Proceedings of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023, International Speech Communication Association , 2023, s. 69-74Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Speech synthesis evaluation methods have lagged behind the development of TTS systems, with single sentence read-speech MOS naturalness evaluation on crowdsourcing platforms being the industry standard. For TTS to successfully be applied insocial contexts, evaluation methods need to be socially embedded in the situation where they will be deployed. Due to the time and cost constraints of conducting an in-person interaction evaluation for TTS, we examine the effect of introducing situational context and preceding sentence context to participants in a subjective listening experiment. We conduct a suitability evaluation for a robot game guide that explains game rules to participants using two synthesized spontaneous voices: an instruction-specific and a general spontaneous voice. Results indicate that the inclusion of context influences user ratings, highlighting the need for context-aware evaluations. However, the type of context did not significantly affect the results.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2023
Emneord
speech synthesis, text to speech, evaluation, social, context
HSV kategori
Forskningsprogram
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-365096 (URN)10.21437/SSW.2023-11 (DOI)
Konferanse
12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023
Prosjekter
VR-2019-05003VR-2020-02396P20-0298
Forskningsfinansiär
Swedish Research Council, 2019-05003Swedish Research Council, 2020-02396Riksbankens Jubileumsfond, P20-0298
Merknad

QC 20250701

Tilgjengelig fra: 2025-06-18 Laget: 2025-06-18 Sist oppdatert: 2025-07-01bibliografisk kontrollert
Moell, B., O'Regan, J., Mehta, S., Kirkland, A., Lameris, H., Gustafsson, J. & Beskow, J. (2022). Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge. In: Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser (Ed.), The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments. Paper presented at 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022 (pp. 62-70). Marseille, France
Åpne denne publikasjonen i ny fane eller vindu >>Speech Data Augmentation for Improving Phoneme Transcriptions of Aphasic Speech Using Wav2Vec 2.0 for the PSST Challenge
Vise andre…
2022 (engelsk)Inngår i: The RaPID4 Workshop: Resources and ProcessIng of linguistic, para-linguistic and extra-linguistic Data from people with various forms of cognitive/psychiatric/developmental impairments / [ed] Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser, Marseille, France, 2022, s. 62-70Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

As part of the PSST challenge, we explore how data augmentations, data sources, and model size affect phoneme transcription accuracy on speech produced by individuals with aphasia. We evaluate model performance in terms of feature error rate (FER) and phoneme error rate (PER). We find that data augmentations techniques, such as pitch shift, improve model performance. Additionally, increasing the size of the model decreases FER and PER. Our experiments also show that adding manually-transcribed speech from non-aphasic speakers (TIMIT) improves performance when Room Impulse Response is used to augment the data. The best performing model combines aphasic and non-aphasic data and has a 21.0% PER and a 9.2% FER, a relative improvement of 9.8% compared to the baseline model on the primary outcome measurement. We show that data augmentation, larger model size, and additional non-aphasic data sources can be helpful in improving automatic phoneme recognition models for people with aphasia.

sted, utgiver, år, opplag, sider
Marseille, France: , 2022
Emneord
aphasia, data augmentation, phoneme transcription, phonemes, speech, speech data augmentation, wav2vec 2.0
HSV kategori
Forskningsprogram
Tal- och musikkommunikation
Identifikatorer
urn:nbn:se:kth:diva-314262 (URN)2-s2.0-85145876107 (Scopus ID)
Konferanse
4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022, Marseille, France, Jun 25 2022
Merknad

QC 20220815

Tilgjengelig fra: 2022-06-17 Laget: 2022-06-17 Sist oppdatert: 2025-10-17bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0001-9537-8505