kth.sePublications
Change search
Link to record
Permanent link

Direct link
Gustafsson, Joakim, ProfessorORCID iD iconorcid.org/0000-0002-0397-6442
Alternative names
Publications (10 of 164) Show all publications
Francis, J., Székely, É. & Gustafsson, J. (2024). ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1001-1002). International Speech Communication Association
Open this publication in new window or tab >>ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1001-1002Conference paper, Published paper (Refereed)
Abstract [en]

Recent developments in generative language modeling and conversational Text-to-Speech present transformative potential for enhancing Augmentative and Alternative Communication (AAC) devices. Practical application of these technologies requires extensive research and testing. To address this, we introduce ConnecTone, a modular platform designed for rapid integration and testing of language generation and speech technology. ConnecTone implements context-sensitive generative text prediction, using conversational context from Automatic Speech Recognition inputs. The system incorporates a neural TTS that supports interpolation between reading and spontaneous conversational styles, along with adjustable prosodic features. These speech characteristics are predicted using Large Language Models, but can be adjusted by users for individual needs. We anticipate ConnecTone will enable us to rapidly evaluate and implement innovations, thereby contributing to faster benefit delivery to AAC users.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
AAC, Human-Computer Interaction, Speech Synthesis, TTS
National Category
Natural Language Processing Computer Sciences Human Computer Interaction
Identifiers
urn:nbn:se:kth:diva-358873 (URN)2-s2.0-85214814511 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250124

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-24Bibliographically approved
Wang, S., Székely, É. & Gustafsson, J. (2024). Contextual Interactive Evaluation of TTS Models in Dialogue Systems. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2965-2969). International Speech Communication Association
Open this publication in new window or tab >>Contextual Interactive Evaluation of TTS Models in Dialogue Systems
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2965-2969Conference paper, Published paper (Refereed)
Abstract [en]

Evaluation of text-to-speech (TTS) models is currently dominated by Mean-Opinion-Score (MOS) listening test, but MOS has been increasingly questioned for its validity. MOS tests place listeners in a passive setup, in which they do not actively interact with the TTS and usually evaluate isolated utterances without context. Thus it gives no indication for how well a TTS model suits an interactive application like spoken dialogue system, in which the capability of generating appropriate speech in the dialogue context is paramount. We aim to take a first step towards addressing this shortcoming by evaluating several state-of-the-art neural TTS models, including one that adapts to dialogue context, in a custom-built spoken dialogue system. We present system design, experiment setup, and results. Our work is the first to evaluate TTS in contextual dialogue system interactions. We also discuss the shortcomings and future opportunities of the proposed evaluation paradigm.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
evaluation methodology, human-computer interaction, spoken dialogue system, text-to-speech
National Category
Natural Language Processing Other Engineering and Technologies
Identifiers
urn:nbn:se:kth:diva-358876 (URN)10.21437/Interspeech.2024-1008 (DOI)2-s2.0-85214809755 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-02-13Bibliographically approved
Lameris, H., Gustafsson, J. & Székely, É. (2024). CreakVC: A Voice Conversion Tool for Modulating Creaky Voice. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024 (pp. 1005-1006). International Speech Communication Association
Open this publication in new window or tab >>CreakVC: A Voice Conversion Tool for Modulating Creaky Voice
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 1005-1006Conference paper, Published paper (Refereed)
Abstract [en]

We introduce a human-in-the-loop one-shot voice conversion tool called CreakVC designed to modulate the level of creaky voice in the converted speech. Creaky voice, often used by speakers to convey sociolinguistic cues, presents challenges to speech processing due to its complex phonation characteristics. The primary goal of CreakVC is to enable in-depth research into how these cues are perceived, using systematic perceptual studies. CreakVC provides access to a diverse range of voice identities exhibiting creaky voice, while maintaining consistency in other parameters. We developed a spectrogram-frame level creak representation using CreaPy and finetuned FreeVC, a one-shot voice conversion tool, by conditioning the speaker embedding and the self-supervised audio representation with the creak representation. An integrated plotting feature allows users to visualize and manipulate portions of speech for precise adjustments of creaky phonation levels. Beyond research, CreakVC has potential applications in voice-interactive systems and multimedia production.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
creaky voice, TTS, voice conversion
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-358875 (URN)2-s2.0-85214828772 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, September 1-5, 2024
Note

QC 20250124

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-01-24Bibliographically approved
Abelho Pereira, A. T., Marcinek, L., Miniotaitė, J., Thunberg, S., Lagerstedt, E., Gustafsson, J., . . . Irfan, B. (2024). Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models. In: : . Paper presented at 26th International Conference on Multimodal Interaction (ICMI), San Jose, USA, November 4-8, 2024 (pp. 469-478). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Multimodal User Enjoyment Detection in Human-Robot Conversation: The Power of Large Language Models
Show others...
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Enjoyment is a crucial yet complex indicator of positive user experience in Human-Robot Interaction (HRI). While manual enjoyment annotation is feasible, developing reliable automatic detection methods remains a challenge. This paper investigates a multimodal approach to automatic enjoyment annotation for HRI conversations, leveraging large language models (LLMs), visual, audio, and temporal cues. Our findings demonstrate that both text-only and multimodal LLMs with carefully designed prompts can achieve performance comparable to human annotators in detecting user enjoyment. Furthermore, results reveal a stronger alignment between LLM-based annotations and user self-reports of enjoyment compared to human annotators. While multimodal supervised learning techniques did not improve all of our performance metrics, they could successfully replicate human annotators and highlighted the importance of visual and audio cues in detecting subtle shifts in enjoyment. This research demonstrates the potential of LLMs for real-time enjoyment detection, paving the way for adaptive companion robots that can dynamically enhance user experiences.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
Afect Recognition, Human-Robot Interaction, Large Language Models, Multimodal, Older Adults, User Enjoyment
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-359146 (URN)10.1145/3678957.3685729 (DOI)2-s2.0-85212589337 (Scopus ID)
Conference
26th International Conference on Multimodal Interaction (ICMI), San Jose, USA, November 4-8, 2024
Note

QC 20250127

Available from: 2025-01-27 Created: 2025-01-27 Last updated: 2025-02-07Bibliographically approved
Tånnander, C., Edlund, J. & Gustafsson, J. (2024). Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 14111-14121). European Language Resources Association (ELRA)
Open this publication in new window or tab >>Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 14111-14121Conference paper, Published paper (Refereed)
Abstract [en]

In order to investigate the strengths and weaknesses of Audience Response System (ARS) in text-to-speech synthesis (TTS) evaluations, we revisit three previously published TTS studies and perform an ARS-based evaluation on the stimuli used in each study. The experiments are performed with a participant pool of 39 respondents, using a web-based tool that emulates an ARS experiment. The results of the first experiment confirms that ARS is highly useful for evaluating long and continuous stimuli, particularly if we wish for a diagnostic result rather than a single overall metric, while the second and third experiments highlight weaknesses in ARS with unsuitable materials as well as the importance of framing and instruction when conducting ARS-based evaluation.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
audience response system, evaluation methodology, TTS evaluation
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-348784 (URN)2-s2.0-85195897862 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

Part of ISBN 9782493814104

QC 20240701

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-02-07Bibliographically approved
Lameris, H., Székely, É. & Gustafsson, J. (2024). The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS. In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings: . Paper presented at Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024 (pp. 16058-16065). European Language Resources Association (ELRA)
Open this publication in new window or tab >>The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS
2024 (English)In: 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, European Language Resources Association (ELRA) , 2024, p. 16058-16065Conference paper, Published paper (Refereed)
Abstract [en]

Recent advancements in spontaneous text-to-speech (TTS) have enabled the realistic synthesis of creaky voice, a voice quality known for its diverse pragmatic and paralinguistic functions. In this study, we used synthesized creaky voice in perceptual tests, to explore how listeners without formal training perceive two distinct types of creaky voice. We annotated a spontaneous speech corpus using creaky voice detection tools and modified a neural TTS engine with a creaky phonation embedding to control the presence of creaky phonation in the synthesized speech. We performed an objective analysis using a creak detection tool which revealed significant differences in creaky phonation levels between the two creaky voice types and modal voice. Two subjective listening experiments were performed to investigate the effect of creaky voice on perceived certainty, valence, sarcasm, and turn finality. Participants rated non-positional creak as less certain, less positive, and more indicative of turn finality, while positional creak was rated significantly more turn final compared to modal phonation.

Place, publisher, year, edition, pages
European Language Resources Association (ELRA), 2024
Keywords
creaky voice, speech perception, speech synthesis, voice quality
National Category
General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-348782 (URN)2-s2.0-85195915140 (Scopus ID)
Conference
Joint 30th International Conference on Computational Linguistics and 14th International Conference on Language Resources and Evaluation, LREC-COLING 2024, Hybrid, Torino, Italy, May 20 2024 - May 25 2024
Note

QC 20240701

Part of ISBN 978-249381410-4

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2024-07-01Bibliographically approved
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS. In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings. Paper presented at 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
2023 (English)In: ICASSPW 2023: 2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Recent work has explored using self-supervised learning (SSL) speech representations such as wav2vec2.0 as the representation medium in standard two-stage TTS, in place of conventionally used mel-spectrograms. It is however unclear which speech SSL is the better fit for TTS, and whether or not the performance differs between read and spontaneous TTS, the later of which is arguably more challenging. This study aims at addressing these questions by testing several speech SSLs, including different layers of the same SSL, in two-stage TTS on both read and spontaneous corpora, while maintaining constant TTS model architecture and training settings. Results from listening tests show that the 9th layer of 12-layer wav2vec2.0 (ASR finetuned) outperforms other tested SSLs and mel-spectrogram, in both read and spontaneous TTS. Our work sheds light on both how speech SSL can readily improve current TTS systems, and how SSLs compare in the challenging generative task of TTS. Audio examples can be found at https://www.speech.kth.se/tts-demos/ssr_tts

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
Keywords
self-supervised speech representation, speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-335090 (URN)10.1109/ICASSPW59220.2023.10193157 (DOI)001046933700056 ()2-s2.0-85165623363 (Scopus ID)
Conference
2023 IEEE International Conference on Acoustics, Speech and Signal Processing Workshops, ICASSPW 2023, Rhodes Island, Greece, Jun 4 2023 - Jun 10 2023
Note

Part of ISBN 9798350302615

QC 20230831

Available from: 2023-08-31 Created: 2023-08-31 Last updated: 2025-02-07Bibliographically approved
Wang, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). A comparative study of self-supervised speech representationsin read and spontaneous TTS. Paper presented at 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece.
Open this publication in new window or tab >>A comparative study of self-supervised speech representationsin read and spontaneous TTS
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

Recent work has explored using self-supervised learning(SSL) speech representations such as wav2vec2.0 as the rep-resentation medium in standard two-stage TTS, in place ofconventionally used mel-spectrograms. It is however unclearwhich speech SSL is the better fit for TTS, and whether ornot the performance differs between read and spontaneousTTS, the later of which is arguably more challenging. Thisstudy aims at addressing these questions by testing severalspeech SSLs, including different layers of the same SSL, intwo-stage TTS on both read and spontaneous corpora, whilemaintaining constant TTS model architecture and trainingsettings. Results from listening tests show that the 9th layerof 12-layer wav2vec2.0 (ASR finetuned) outperforms othertested SSLs and mel-spectrogram, in both read and sponta-neous TTS. Our work sheds light on both how speech SSL canreadily improve current TTS systems, and how SSLs comparein the challenging generative task of TTS. Audio examplescan be found at https://www.speech.kth.se/tts-demos/ssr tts

Keywords
speech synthesis, self-supervised speech representation, spontaneous speech
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering Other Engineering and Technologies
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-328741 (URN)979-8-3503-0261-5 (ISBN)
Conference
2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece
Projects
Digital Futures project Advanced Adaptive Intelligent Systems (AAIS)Swedish Research Council project Connected (VR-2019-05003)Swedish Research Council project Perception of speaker stance (VR-2020- 02396)Riksbankens Jubileumsfond project CAPTivating (P20-0298)Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation
Note

Accepted by the 2023 IEEE International Conference on Acoustics, Speech,and Signal Processing Workshops, 4-10 Jun 2023, Rhodes Island, Greece

QC 20230620

Available from: 2023-06-12 Created: 2023-06-12 Last updated: 2025-02-18Bibliographically approved
Peña, P. R., Doyle, P. R., Ip, E. Y., Di Liberto, G., Higgins, D., McDonnell, R., . . . Cowan, B. R. (2023). A Special Interest Group on Developing Theories of Language Use in Interaction with Conversational User Interfaces. In: CHI 2023: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems. Paper presented at 2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, Apr 23 2023 - Apr 28 2023. Association for Computing Machinery (ACM), Article ID 509.
Open this publication in new window or tab >>A Special Interest Group on Developing Theories of Language Use in Interaction with Conversational User Interfaces
Show others...
2023 (English)In: CHI 2023: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems, Association for Computing Machinery (ACM) , 2023, article id 509Conference paper, Published paper (Refereed)
Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2023
Keywords
conversational user interfaces, human-machine dialogue, psycholinguistic models, speech agents
National Category
Human Computer Interaction Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-333352 (URN)10.1145/3544549.3583179 (DOI)2-s2.0-85153273115 (Scopus ID)
Conference
2023 CHI Conference on Human Factors in Computing Systems, CHI 2023, Hamburg, Germany, Apr 23 2023 - Apr 28 2023
Note

Part of ISBN 9781450394222

QC 20230801

Available from: 2023-08-01 Created: 2023-08-01 Last updated: 2025-02-01Bibliographically approved
Ekstedt, E., Wang, S., Székely, É., Gustafsson, J. & Skantze, G. (2023). Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 5481-5485). International Speech Communication Association
Open this publication in new window or tab >>Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis
Show others...
2023 (English)In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023, International Speech Communication Association , 2023, p. 5481-5485Conference paper, Published paper (Refereed)
Abstract [en]

Turn-taking is a fundamental aspect of human communication where speakers convey their intention to either hold, or yield, their turn through prosodic cues. Using the recently proposed Voice Activity Projection model, we propose an automatic evaluation approach to measure these aspects for conversational speech synthesis. We investigate the ability of three commercial, and two open-source, Text-To-Speech (TTS) systems ability to generate turn-taking cues over simulated turns. By varying the stimuli, or controlling the prosody, we analyze the models performances. We show that while commercial TTS largely provide appropriate cues, they often produce ambiguous signals, and that further improvements are possible. TTS, trained on read or spontaneous speech, produce strong turn-hold but weak turn-yield cues. We argue that this approach, that focus on functional aspects of interaction, provides a useful addition to other important speech metrics, such as intelligibility and naturalness.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
human-computer interaction, text-to-speech, turn-taking
National Category
Natural Language Processing Computer Sciences General Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-337873 (URN)10.21437/Interspeech.2023-2064 (DOI)001186650305133 ()2-s2.0-85171597862 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Projects
tmh_turntaking
Note

QC 20241024

Available from: 2023-10-10 Created: 2023-10-10 Last updated: 2025-02-01Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0397-6442

Search in DiVA

Show all publications