Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 12) Visa alla publikasjoner
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)2-s2.0-85214785956 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250128

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-01-28bibliografisk kontrollert
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Åpne denne publikasjonen i ny fane eller vindu >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Vise andre…
2024 (engelsk)Inngår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferansepaper, Publicerat paper (Fagfellevurdert)
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferanse
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Prosjekter
bodytalk
Merknad

QC 20241022

Tilgjengelig fra: 2024-10-22 Laget: 2024-10-22 Sist oppdatert: 2024-10-22bibliografisk kontrollert
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Vise andre…
2024 (engelsk)Inngår i: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 11341-11345Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Emneord
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Konferanse
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Merknad

Part of ISBN 9798350344851

QC 20240716

Tilgjengelig fra: 2024-07-16 Laget: 2024-07-16 Sist oppdatert: 2025-03-26bibliografisk kontrollert
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Vise andre…
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2285-2289Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)2-s2.0-85214793947 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250127

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-02-25bibliografisk kontrollert
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Unified speech and gesture synthesis using flow matching
Vise andre…
2024 (engelsk)Inngår i: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 8220-8224Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Serie
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Emneord
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Konferanse
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Merknad

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Tilgjengelig fra: 2025-04-02 Laget: 2025-04-02 Sist oppdatert: 2025-04-09bibliografisk kontrollert
Deichler, A., Mehta, S., Alexanderson, S. & Beskow, J. (2023). Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation. In: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023: . Paper presented at 25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE (pp. 755-762). Association for Computing Machinery (ACM)
Åpne denne publikasjonen i ny fane eller vindu >>Difusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
2023 (engelsk)Inngår i: PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2023, Association for Computing Machinery (ACM) , 2023, s. 755-762Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing difusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the difusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

sted, utgiver, år, opplag, sider
Association for Computing Machinery (ACM), 2023
Emneord
gesture generation, motion synthesis, difusion models, contrastive pre-training, semantic gestures
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-343773 (URN)10.1145/3577190.3616117 (DOI)001147764700093 ()2-s2.0-85170496681 (Scopus ID)
Konferanse
25th International Conference on Multimodal Interaction (ICMI), OCT 09-13, 2023, Sorbonne Univ, Paris, FRANCE
Merknad

Part of proceedings ISBN 979-8-4007-0055-2

QC 20240222

Tilgjengelig fra: 2024-02-22 Laget: 2024-02-22 Sist oppdatert: 2025-02-07bibliografisk kontrollert
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É. & Henter, G. E. (2023). OverFlow: Putting flows on top of neural transducers for better TTS. In: Interspeech 2023: . Paper presented at 24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland (pp. 4279-4283). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>OverFlow: Putting flows on top of neural transducers for better TTS
Vise andre…
2023 (engelsk)Inngår i: Interspeech 2023, International Speech Communication Association , 2023, s. 4279-4283Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2023
Emneord
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)001186650304087 ()2-s2.0-85167953412 (Scopus ID)
Konferanse
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Merknad

QC 20241014

Tilgjengelig fra: 2023-11-07 Laget: 2023-11-07 Sist oppdatert: 2025-02-07bibliografisk kontrollert
Lameris, H., Mehta, S., Henter, G. E., Gustafsson, J. & Székely, É. (2023). Prosody-Controllable Spontaneous TTS with Neural HMMs. In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP): . Paper presented at International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Prosody-Controllable Spontaneous TTS with Neural HMMs
Vise andre…
2023 (engelsk)Inngår i: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2023Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Spontaneous speech has many affective and pragmatic functions that are interesting and challenging to model in TTS. However, the presence of reduced articulation, fillers, repetitions, and other disfluencies in spontaneous speech make the text and acoustics less aligned than in read speech, which is problematic for attention-based TTS. We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets, while also reproducing the diversity of expressive phenomena present in spontaneous speech. Specifically, we add utterance-level prosody control to an existing neural HMM-based TTS system which is capable of stable, monotonic alignments for spontaneous speech. We objectively evaluate control accuracy and perform perceptual tests that demonstrate that prosody control does not degrade synthesis quality. To exemplify the power of combining prosody control and ecologically valid data for reproducing intricate spontaneous speech phenomena, we evaluate the system’s capability of synthesizing two types of creaky voice.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2023
Emneord
Speech Synthesis, Prosodic Control, NeuralHMM, Spontaneous speech, Creaky voice
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-327893 (URN)10.1109/ICASSP49357.2023.10097200 (DOI)2-s2.0-85174033357 (Scopus ID)
Konferanse
International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Forskningsfinansiär
Swedish Research Council, VR-2019-05003Swedish Research Council, VR-2020-02396Riksbankens Jubileumsfond, P20-0298Knut and Alice Wallenberg Foundation, WASP
Merknad

QC 20230602

Tilgjengelig fra: 2023-06-01 Laget: 2023-06-01 Sist oppdatert: 2024-08-28bibliografisk kontrollert
Cumbal, R., Axelsson, A., Mehta, S. & Engwall, O. (2023). Stereotypical nationality representations in HRI: perspectives from international young adults. Frontiers in Robotics and AI, 10, Article ID 1264614.
Åpne denne publikasjonen i ny fane eller vindu >>Stereotypical nationality representations in HRI: perspectives from international young adults
2023 (engelsk)Inngår i: Frontiers in Robotics and AI, E-ISSN 2296-9144, Vol. 10, artikkel-id 1264614Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

People often form immediate expectations about other people, or groups of people, based on visual appearance and characteristics of their voice and speech. These stereotypes, often inaccurate or overgeneralized, may translate to robots that carry human-like qualities. This study aims to explore if nationality-based preconceptions regarding appearance and accents can be found in people's perception of a virtual and a physical social robot. In an online survey with 80 subjects evaluating different first-language-influenced accents of English and nationality-influenced human-like faces for a virtual robot, we find that accents, in particular, lead to preconceptions on perceived competence and likeability that correspond to previous findings in social science research. In a physical interaction study with 74 participants, we then studied if the perception of competence and likeability is similar after interacting with a robot portraying one of four different nationality representations from the online survey. We find that preconceptions on national stereotypes that appeared in the online survey vanish or are overshadowed by factors related to general interaction quality. We do, however, find some effects of the robot's stereotypical alignment with the subject group, with Swedish subjects (the majority group in this study) rating the Swedish-accented robot as less competent than the international group, but, on the other hand, recalling more facts from the Swedish robot's presentation than the international group does. In an extension in which the physical robot was replaced by a virtual robot interacting in the same scenario online, we further found the same results that preconceptions are of less importance after actual interactions, hence demonstrating that the differences in the ratings of the robot between the online survey and the interaction is not due to the interaction medium. We hence conclude that attitudes towards stereotypical national representations in HRI have a weak effect, at least for the user group included in this study (primarily educated young students in an international setting).

sted, utgiver, år, opplag, sider
Frontiers Media SA, 2023
Emneord
accent, appearance, social robot, nationality, stereotype, impression, competence, likeability
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-341526 (URN)10.3389/frobt.2023.1264614 (DOI)001115613500001 ()38077460 (PubMedID)2-s2.0-85178920101 (Scopus ID)
Merknad

QC 20231222

Tilgjengelig fra: 2023-12-22 Laget: 2023-12-22 Sist oppdatert: 2024-02-26bibliografisk kontrollert
Mehta, S., Székely, É., Beskow, J. & Henter, G. E. (2022). Neural HMMs are all you need (for high-quality attention-free TTS). In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): . Paper presented at 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore (pp. 7457-7461). IEEE Signal Processing Society
Åpne denne publikasjonen i ny fane eller vindu >>Neural HMMs are all you need (for high-quality attention-free TTS)
2022 (engelsk)Inngår i: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, 2022, s. 7457-7461Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.

sted, utgiver, år, opplag, sider
IEEE Signal Processing Society, 2022
Serie
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ISSN 2379-190X
Emneord
seq2seq, attention, HMMs, duration modelling, acoustic modelling
HSV kategori
Forskningsprogram
Datalogi
Identifikatorer
urn:nbn:se:kth:diva-312455 (URN)10.1109/ICASSP43922.2022.9746686 (DOI)000864187907152 ()2-s2.0-85131260082 (Scopus ID)
Konferanse
47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore
Forskningsfinansiär
Knut and Alice Wallenberg Foundation, WASP
Merknad

Part of proceedings: ISBN 978-1-6654-0540-9

QC 20220601

Tilgjengelig fra: 2022-05-18 Laget: 2022-05-18 Sist oppdatert: 2025-02-01bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-1886-681X