kth.sePublikationer
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (10 of 12) Visa alla publikationer
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
Nationell ämneskategori
Språkbehandling och datorlingvistik Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)2-s2.0-85214785956 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Anmärkning

QC 20250128

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-01-28Bibliografiskt granskad
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Öppna denna publikation i ny flik eller fönster >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Visa övriga...
2024 (Engelska)Ingår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferensbidrag, Publicerat paper (Refereegranskat)
Nationell ämneskategori
Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferens
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projekt
bodytalk
Anmärkning

QC 20241022

Tillgänglig från: 2024-10-22 Skapad: 2024-10-22 Senast uppdaterad: 2024-10-22Bibliografiskt granskad
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Visa övriga...
2024 (Engelska)Ingår i: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 11341-11345Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Nyckelord
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Konferens
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Anmärkning

Part of ISBN 9798350344851

QC 20240716

Tillgänglig från: 2024-07-16 Skapad: 2024-07-16 Senast uppdaterad: 2025-03-26Bibliografiskt granskad
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Visa övriga...
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2285-2289Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)2-s2.0-85214793947 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Anmärkning

QC 20250127

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-02-25Bibliografiskt granskad
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Unified speech and gesture synthesis using flow matching
Visa övriga...
2024 (Engelska)Ingår i: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 8220-8224Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Serie
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Nyckelord
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Konferens
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Anmärkning

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Tillgänglig från: 2025-04-02 Skapad: 2025-04-02 Senast uppdaterad: 2025-04-09Bibliografiskt granskad
Deichler, A., Mehta, S., Alexanderson, S. & Beskow, J. (2023). Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation. In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023: . Paper presented at 25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE (pp. 755-762). Association for Computing Machinery (ACM)
Öppna denna publikation i ny flik eller fönster >>Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
2023 (Engelska)Ingår i: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, s. 755-762Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Ort, förlag, år, upplaga, sidor
Association for Computing Machinery (ACM), 2023
Nyckelord
gesture generation, motion synthesis, difusion models, contrastive pre-training, semantic gestures
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-343773 (URN)10.1145/3577190.3616117 (DOI)001147764700093 ()2-s2.0-85170496681 (Scopus ID)
Konferens
25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE
Anmärkning

Part of proceedings ISBN 979-8-4007-0055-2

QC 20240222

Tillgänglig från: 2024-02-22 Skapad: 2024-02-22 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 4279-4283). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>OverFlow: Putting flows on top of neural transducers for better TTS
Visa övriga...
2023 (Engelska)Ingår i: Interspeech 2023, International Speech Communication Association , 2023, s. 4279-4283Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2023
Nyckelord
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)001186650304087 ()2-s2.0-85167953412 (Scopus ID)
Konferens
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Anmärkning

QC 20241014

Tillgänglig från: 2023-11-07 Skapad: 2023-11-07 Senast uppdaterad: 2025-02-07Bibliografiskt granskad
Lameris, H., Mehta, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). Prosody-Controllable Spontaneous TTS with Neural HMMs. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP): . Paper presented at International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Prosody-Controllable Spontaneous TTS with Neural HMMs
Visa övriga...
2023 (Engelska)Ingår i: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2023
Nyckelord
Speech Synthesis, Prosodic Control, NeuralHMM, Spontaneous speech, Creaky voice
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-327893 (URN)10.1109/ICASSP49357.2023.10097200 (DOI)2-s2.0-85174033357 (Scopus ID)
Konferens
International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Forskningsfinansiär
Vetenskapsrådet, VR-2019-05003Vetenskapsrådet, VR-2020-02396Riksbankens Jubileumsfond, P20-0298Knut och Alice Wallenbergs Stiftelse, WASP
Anmärkning

QC 20230602

Tillgänglig från: 2023-06-01 Skapad: 2023-06-01 Senast uppdaterad: 2024-08-28Bibliografiskt granskad
Cumbal, R., Axelsson, A., Mehta, S. & Engwall, O. (2023). Stereotypical nationality representations in HRI: perspectives from international young adults. Frontiers in Robotics and AI, 10, Article ID 1264614.
Öppna denna publikation i ny flik eller fönster >>Stereotypical nationality representations in HRI: perspectives from international young adults
2023 (Engelska)Ingår i: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, artikel-id 1264614Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

People often form immediate expectations about other people, or groups of people, based on visual appearance and characteristics of their voice and speech. These stereotypes, often inaccurate or overgeneralized, may translate to robots that carry human-like qualities. This study aims to explore if nationality-based preconceptions regarding appearance and accents can be found in people's perception of a virtual and a physical social robot. In an online survey with 80 subjects evaluating different first-language-influenced accents of English and nationality-influenced human-like faces for a virtual robot, we find that accents, in particular, lead to preconceptions on perceived competence and likeability that correspond to previous findings in social science research. In a physical interaction study with 74 participants, we then studied if the perception of competence and likeability is similar after interacting with a robot portraying one of four different nationality representations from the online survey. We find that preconceptions on national stereotypes that appeared in the online survey vanish or are overshadowed by factors related to general interaction quality. We do, however, find some effects of the robot's stereotypical alignment with the subject group, with Swedish subjects (the majority group in this study) rating the Swedish-accented robot as less competent than the international group, but, on the other hand, recalling more facts from the Swedish robot's presentation than the international group does. In an extension in which the physical robot was replaced by a virtual robot interacting in the same scenario online, we further found the same results that preconceptions are of less importance after actual interactions, hence demonstrating that the differences in the ratings of the robot between the online survey and the interaction is not due to the interaction medium. We hence conclude that attitudes towards stereotypical national representations in HRI have a weak effect, at least for the user group included in this study (primarily educated young students in an international setting).

Ort, förlag, år, upplaga, sidor
Frontiers Media SA, 2023
Nyckelord
accent, appearance, social robot, nationality, stereotype, impression, competence, likeability
Nationell ämneskategori
Människa-datorinteraktion (interaktionsdesign)
Identifikatorer
urn:nbn:se:kth:diva-341526 (URN)10.3389/frobt.2023.1264614 (DOI)001115613500001 ()38077460 (PubMedID)2-s2.0-85178920101 (Scopus ID)
Anmärkning

QC 20231222

Tillgänglig från: 2023-12-22 Skapad: 2023-12-22 Senast uppdaterad: 2024-02-26Bibliografiskt granskad
Mehta, S., Székely, É., Beskow, J. & Henter, G. E. (2022). Neural HMMs are all you need (for high-quality attention-free TTS). In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): . Paper presented at 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore (pp. 7457-7461). IEEE Signal Processing Society
Öppna denna publikation i ny flik eller fönster >>Neural HMMs are all you need (for high-quality attention-free TTS)
2022 (Engelska)Ingår i: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, 2022, s. 7457-7461Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.

Ort, förlag, år, upplaga, sidor
IEEE Signal Processing Society, 2022
Serie
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ISSN 2379-190X
Nyckelord
seq2seq, attention, HMMs, duration modelling, acoustic modelling
Nationell ämneskategori
Språkbehandling och datorlingvistik Sannolikhetsteori och statistik Datavetenskap (datalogi)
Forskningsämne
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-312455 (URN)10.1109/ICASSP43922.2022.9746686 (DOI)000864187907152 ()2-s2.0-85131260082 (Scopus ID)
Konferens
47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore
Forskningsfinansiär
Knut och Alice Wallenbergs Stiftelse, WASP
Anmärkning

Part of proceedings: ISBN 978-1-6654-0540-9

QC 20220601

Tillgänglig från: 2022-05-18 Skapad: 2022-05-18 Senast uppdaterad: 2025-02-01Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-1886-681X

Sök vidare i DiVA

Visa alla publikationer