kth.sePublications
Change search
Link to record
Permanent link

Direct link
Gustafsson, Joakim, ProfessorORCID iD iconorcid.org/0000-0002-0397-6442
Alternative names
Publications (10 of 157) Show all publications
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS. In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings. Paper presented at 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
2023 (English)In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
self-supervised speech representation, speech synthesis, spontaneous speech
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-335090 (URN)10.1109/ICASSPW59220.2023.10193157 (DOI)001046933700056 ()2-s2.0-85165623363 (Scopus ID)
Conference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023
Note

Part of ISBN 9798350302615

QC 20230831

Available from: 2023-08-31 Created: 2023-08-31 Last updated: 2023-09-26Bibliographically approved
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A comparative study of self-supervised speech representationsin read and spontaneous TTS. Paper presented at 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece.
Open this publication in new window or tab >>A comparative study of self-supervised speech representationsin read and spontaneous TTS
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts

Keywords
speech synthesis, self-supervised speech representation, spontaneous speech
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Interaction Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328741 (URN)979-8-3503-0261-5 (ISBN)
Conference
2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece
Projects
Digital Futures project Advanced Adaptive Intelligent Systems (AAIS)Swedish Research Council project Connected (VR-2019-05003)Swedish Research Council project Perception of speaker stance (VR-2020- 02396)Riksbankens Jubileumsfond project CAPTivating (P20-0298)Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation
Note

Accepted by the 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece

QC 20230620

Available from: 2023-06-12 Created: 2023-06-12 Last updated: 2023-06-20Bibliographically approved
Peña, P. R., Doyle, P. R., Ip, E. Y., Di Liberto, G., Higgins, D., McDonnell, R., . . . Cowan, B. R. (2023). A Special Interest Group on Developing Theories of Language Use in Interaction with Conversational User Interfaces. In: CHI 2023: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. Paper presented at 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, Apr 23 2023 - Apr 28 2023. Association for Computing Machinery (ACM), Article ID 509.
Open this publication in new window or tab >>A Special Interest Group on Developing Theories of Language Use in Interaction with Conversational User Interfaces
Show others...
2023 (English)In: CHI 2023: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery (ACM) , 2023, article id 509Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
conversational user interfaces, human-machine dialogue, psycholinguistic models, speech agents
National Category
Human Computer Interaction Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-333352 (URN)10.1145/3544549.3583179 (DOI)2-s2.0-85153273115 (Scopus ID)
Conference
2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, Apr 23 2023 - Apr 28 2023
Note

Part of ISBN 9781450394222

QC 20230801

Available from: 2023-08-01 Created: 2023-08-01 Last updated: 2023-08-01Bibliographically approved
Ekstedt, E., Wang, S., Székely, É., Gustafsson, J. & Skantze, G. (2023). Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 5481-5485). International Speech Communication Association
Open this publication in new window or tab >>Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis
Show others...
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 5481-5485Conference paper, Published paper (Refereed)
Abstract [en]

Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To-Speech (TTS) systems ability to generate turn-taking cues over simulated turns. By varying the stimuli, or controlling the prosody, we analyze the models performances. We show that while commercial TTS largely provide appropriate cues, they often produce ambiguous signals, and that further improvements are possible. TTS, trained on read or spontaneous speech, produce strong turn-hold but weak turn-yield cues. We argue that this approach, that focus on functional aspects of interaction, provides a useful addition to other important speech metrics, such as intelligibility and naturalness.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
human-computer interaction, text-to-speech, turn-taking
National Category
Language Technology (Computational Linguistics) Computer Sciences General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-337873 (URN)10.21437/Interspeech.2023-2064 (DOI)2-s2.0-85171597862 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231010

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2023-10-10Bibliographically approved
Lameris, H., Gustafsson, J. & Székely, É. (2023). Beyond style: synthesizing speech with pragmatic functions. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 3382-3386). International Speech Communication Association
Open this publication in new window or tab >>Beyond style: synthesizing speech with pragmatic functions
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 3382-3386Conference paper, Published paper (Refereed)
Abstract [en]

With recent advances in generative modelling, conversational systems are becoming more lifelike and capable of long, nuanced interactions. Text-to-Speech (TTS) is being tested in territories requiring natural-sounding speech that can mimic the complexities of human conversation. Hyper-realistic speech generation has been achieved, but a gap remains between the verbal behavior required for upscaled conversation, such as paralinguistic information and pragmatic functions, and comprehension of the acoustic prosodic correlates underlying these. Without this knowledge, reproducing these functions in speech has little value. We use prosodic correlates including spectral peaks, spectral tilt, and creak percentage for speech synthesis with the pragmatic functions of small talk, self-directed speech, advice, and instructions. We perform a MOS evaluation, and a suitability experiment in which our system outperforms a read-speech and conversational baseline.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
conversational TTS, pragmatic functions, speech synthesis
National Category
Language Technology (Computational Linguistics) General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-337836 (URN)10.21437/Interspeech.2023-2072 (DOI)2-s2.0-85171537616 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231009

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2023-10-09Bibliographically approved
Gustafsson, J., Székely, É. & Beskow, J. (2023). Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters. In: 23rd ACM International Conference on Interlligent Virtual Agent (IVA 2023): . Paper presented at 23rd ACM International Conference on Intelligent Virtual Agent (IVA 2023), Würzburg, Germany, Jan 5 2023 - Jan 8 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Generation of speech and facial animation with controllable articulatory effort for amusing conversational characters
2023 (English)In: 23rd ACM International Conference on Interlligent Virtual Agent (IVA 2023), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Engaging embodied conversational agents need to generate expressive behavior in order to be believable insocializing interactions. We present a system that can generate spontaneous speech with supporting lip movements. The neural conversational TTSvoice is trained on a multi-style speech corpus that has been prosodically tagged (pitch and speaking rate) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The facial animation is driven by time-stamped phonemes and prominence estimates from the synthesised speech waveform to modulate the lip and jaw movements accordingly. In objective evaluations we show that the system is able to generate speech and facial animation that vary in articulation effort. In subjective evaluations we compare our conversational TTS system’s capability to deliver jokes with a commercial TTS. Both systems succeeded equally good.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
National Category
Language Technology (Computational Linguistics) Robotics
Identifiers
urn:nbn:se:kth:diva-341039 (URN)10.1145/3570945.3607289 (DOI)2-s2.0-85183581153 (Scopus ID)
Conference
23rd ACM International Conference on Intelligent Virtual Agent (IVA 2023), Würzburg, Germany, Jan 5 2023 - Jan 8 2023
Note

Part of ISBN 9798350345445

QC 20231124

Available from: 2023-12-19 Created: 2023-12-19 Last updated: 2024-02-09Bibliographically approved
Miniotaitė, J., Wang, S., Beskow, J., Gustafson, J., Székely, É. & Abelho Pereira, A. T. (2023). Hi robot, it's not what you say, it's how you say it. In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN: . Paper presented at 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), AUG 28-31, 2023, Busan, SOUTH KOREA (pp. 307-314). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Hi robot, it's not what you say, it's how you say it
Show others...
2023 (English)In: 2023 32ND IEEE INTERNATIONAL CONFERENCE ON ROBOT AND HUMAN INTERACTIVE COMMUNICATION, RO-MAN, Institute of Electrical and Electronics Engineers (IEEE) , 2023, p. 307-314Conference paper, Published paper (Refereed)
Abstract [en]

Many robots use their voice to communicate with people in spoken language but the voices commonly used for robots are often optimized for transactional interactions, rather than social ones. This can limit their ability to create engaging and natural interactions. To address this issue, we designed a spontaneous text-to-speech tool and used it to author natural and spontaneous robot speech. A crowdsourcing evaluation methodology is proposed to compare this type of speech to natural speech and state-of-the-art text-to-speech technology, both in disembodied and embodied form. We created speech samples in a naturalistic setting of people playing tabletop games and conducted a user study evaluating Naturalness, Intelligibility, Social Impression, Prosody, and Perceived Intelligence. The speech samples were chosen to represent three contexts that are common in tabletopgames and the contexts were introduced to the participants that evaluated the speech samples. The study results show that the proposed evaluation methodology allowed for a robust analysis that successfully compared the different conditions. Moreover, the spontaneous voice met our target design goal of being perceived as more natural than a leading commercial text-to-speech.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Series
IEEE RO-MAN, ISSN 1944-9445
Keywords
speech synthesis, human-robot interaction, embodiment, spontaneous speech, intelligibility, naturalness
National Category
Interaction Technologies
Identifiers
urn:nbn:se:kth:diva-341972 (URN)10.1109/RO-MAN57019.2023.10309427 (DOI)001108678600044 ()2-s2.0-85186982397 (Scopus ID)
Conference
32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), AUG 28-31, 2023, Busan, SOUTH KOREA
Note

Part of proceedings ISBN 979-8-3503-3670-2

Available from: 2024-01-09 Created: 2024-01-09 Last updated: 2024-03-22Bibliographically approved
Kirkland, A., Gustafsson, J. & Székely, É. (2023). Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 5217-5221). International Speech Communication Association
Open this publication in new window or tab >>Pardon my disfluency: The impact of disfluency effects on the perception of speaker competence and confidence
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 5217-5221Conference paper, Published paper (Refereed)
Abstract [en]

Disfluencies are a hallmark of spontaneous speech and play an important role in conversation, yet have been shown to negatively impact judgments about speakers. We explored the role of disfluencies in the perception of competence, sincerity and confidence in public speaking contexts, using synthesized spontaneous speech. In one experiment, listeners rated 30-40-second clips which varied in terms of whether they contained filled pauses, as well as the number and types of repetition. Both the overall number of disfluencies and the repetition type had an impact on competence and confidence, and disfluent speech was also rated as less sincere. In the second experiment, the negative effects of repetition type on competence were attenuated when participants attributed disfluency to anxiety.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
disfluencies, public speaking, speech perception, speech synthesis, spontaneous speech
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-337835 (URN)10.21437/Interspeech.2023-887 (DOI)2-s2.0-85171528981 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231009

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2023-10-09Bibliographically approved
Székely, É., Gustafsson, J. & Torre, I. (2023). Prosody-controllable gender-ambiguous speech synthesis: a tool for investigating implicit bias in speech perception. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023 (pp. 1234-1238). International Speech Communication Association
Open this publication in new window or tab >>Prosody-controllable gender-ambiguous speech synthesis: a tool for investigating implicit bias in speech perception
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 1234-1238Conference paper, Published paper (Refereed)
Abstract [en]

This paper proposes a novel method to develop gender-ambiguous TTS, which can be used to investigate hidden gender bias in speech perception. Our aim is to provide a tool for researchers to conduct experiments on language use associated with specific genders. Ambiguous voices can also be beneficial for virtual assistants, to help reduce stereotypes and increase acceptance. Our approach uses a multi-speaker embedding in a neural TTS engine, combining two corpora recorded by a male and a female speaker to achieve a gender-ambiguous timbre. We also propose speaker-disentangled prosody control to ensure that the timbre is robust across a range of prosodies and enable more expressive speech. We optimised the output using an SSL-based network trained on hundreds of speakers. We conducted perceptual evaluations on the settings that were judged most ambiguous by the network, which showed that listeners perceived the speech samples as gender-ambiguous, also in prosody-controlled conditions.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
gender bias, human-computer interaction, speech synthesis
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-337832 (URN)10.21437/Interspeech.2023-2086 (DOI)2-s2.0-85171582438 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, Dublin, Ireland, Aug 20 2023 - Aug 24 2023
Note

QC 20231009

Available from: 2023-10-09 Created: 2023-10-09 Last updated: 2023-10-09Bibliographically approved
Lameris, H., Mehta, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). Prosody-Controllable Spontaneous TTS with Neural HMMs. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP): . Paper presented at International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Prosody-Controllable Spontaneous TTS with Neural HMMs
Show others...
2023 (English)In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
Speech Synthesis, Prosodic Control, NeuralHMM, Spontaneous speech, Creaky voice
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-327893 (URN)10.1109/ICASSP49357.2023.10097200 (DOI)
Conference
International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Funder
Swedish Research Council, VR-2019-05003Swedish Research Council, VR-2020-02396Riksbankens Jubileumsfond, P20-0298Knut and Alice Wallenberg Foundation, WASP
Note

QC 20230602

Available from: 2023-06-01 Created: 2023-06-01 Last updated: 2023-06-02Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0397-6442

Search in DiVA

Show all publications