Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 17) Visa alla publikasjoner
Mehta, S., Gamper, H. & Jojic, N. (2025). Make Some Noise: Towards LLM audio reasoning and generation using sound tokens. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): . Paper presented at 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025 (pp. 1-5). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
2025 (engelsk)Inngår i: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2025, s. 1-5Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2025
Emneord
audio language models, multimodal LLMs, audio reasoning, audio captioning, audio tokenization, audio generation
HSV kategori
Forskningsprogram
Datalogi; Datalogi
Identifikatorer
urn:nbn:se:kth:diva-368341 (URN)10.1109/ICASSP49660.2025.10888809 (DOI)2-s2.0-105003881005 (Scopus ID)979-8-3503-6874-1 (ISBN)979-8-3503-6874-1 (ISBN)
Konferanse
2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025
Merknad

Part of ISBN 979-8-3503-6874-1, 979-8-3503-6875-8

QC 20250813

Tilgjengelig fra: 2025-08-13 Laget: 2025-08-13 Sist oppdatert: 2025-08-13bibliografisk kontrollert
Lodagala, V. S., Alkanhal, L., Izham, D., Mehta, S., Chowdhury, S., Makki, A., . . . Ali, A. (2025). SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4793-4797). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching
Vise andre…
2025 (engelsk)Inngår i: Interspeech 2025, International Speech Communication Association , 2025, s. 4793-4797Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Curating Text-to-Speech (TTS) datasets is a strenuous task given the quality considerations. While it is hard to find high-quality TTS datasets in languages other than English, it is rare to come across code-switching (CS) datasets. As a part of this work, we curate a 4-hour Arabic-English TTS corpus consisting of code-switched Egyptian-English, monolingual Modern Standard Arabic (MSA), Egyptian, and English, all recorded by the same voice talent. We demonstrate the importance of vowelization and the need for better phonemization of Arabic text. To this effect, we present the modified espeak-ng phonemizer that handles various irregularities of espeak-ng over Arabic text. Upon training baseline TTS systems over this benchmark, we demonstrate its efficacy through extensive subjective evaluations.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2025
Emneord
Code-switching, Dialectal Speech, Multilingual, Phonemization, Text-to-Speech Synthesis
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-372803 (URN)10.21437/Interspeech.2025-2573 (DOI)2-s2.0-105020056289 (Scopus ID)
Konferanse
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Merknad

QC 20251113

Tilgjengelig fra: 2025-11-13 Laget: 2025-11-13 Sist oppdatert: 2025-11-13bibliografisk kontrollert
Tuttösí, P., Mehta, S., Syvenky, Z., Burkanova, B., Hfsafsti, M., Wang, Y., . . . Lim, A. (2025). Take a Look, it's in a Book, a Reading Robot. In: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction: . Paper presented at 20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025 (pp. 1803-1805). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Take a Look, it's in a Book, a Reading Robot
Vise andre…
2025 (engelsk)Inngår i: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction, Institute of Electrical and Electronics Engineers (IEEE) , 2025, s. 1803-1805Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We demonstrate EmojiVoice, a free, customizable text-to-speech (TTS) toolkit for expressive speech on social robots. We demonstrate our voices through storytelling. This task is aimed to be deployed in classrooms, or libraries where the robot can read a story out loud to children. Moreover, we introduce adaptive clarity to to noisy environments and those with reduced comprehension ability. This storytelling robot voice allows us to demonstrate how, using our light weight and customizable TTS, we are able to have a voice that is expressive, engaging, clear and socially appropriate for the task, improving interactions with and perceptions of social robots.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2025
Emneord
clear speech synthesis, education robots, Expressive speech synthesis, human robot interaction, noise robust speech synthesis, second language speakers, social robotics, storytelling robots
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-363761 (URN)10.1109/HRI61500.2025.10973801 (DOI)2-s2.0-105004876693 (Scopus ID)
Konferanse
20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025
Merknad

 Part of ISBN 9798350378931 QC 20250526

Tilgjengelig fra: 2025-05-21 Laget: 2025-05-21 Sist oppdatert: 2025-05-26bibliografisk kontrollert
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250128

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-08bibliografisk kontrollert
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Åpne denne publikasjonen i ny fane eller vindu >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Vise andre…
2024 (engelsk)Inngår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferansepaper, Publicerat paper (Fagfellevurdert)
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferanse
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Prosjekter
bodytalk
Merknad

QC 20241022

Tilgjengelig fra: 2024-10-22 Laget: 2024-10-22 Sist oppdatert: 2024-10-22bibliografisk kontrollert
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Vise andre…
2024 (engelsk)Inngår i: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 1952-1964Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Emneord
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Konferanse
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Merknad

Part of ISBN 9798350365474

QC 20250715

Tilgjengelig fra: 2025-07-15 Laget: 2025-07-15 Sist oppdatert: 2025-08-13bibliografisk kontrollert
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Vise andre…
2024 (engelsk)Inngår i: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 11341-11345Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Emneord
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Konferanse
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Merknad

Part of ISBN 9798350344851

QC 20240716

Tilgjengelig fra: 2024-07-16 Laget: 2024-07-16 Sist oppdatert: 2025-08-13bibliografisk kontrollert
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Vise andre…
2024 (engelsk)Inngår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2285-2289Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2024
Emneord
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)001331850102086 ()2-s2.0-85214793947 (Scopus ID)
Konferanse
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Merknad

QC 20250127

Tilgjengelig fra: 2025-01-23 Laget: 2025-01-23 Sist oppdatert: 2025-12-05bibliografisk kontrollert
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Åpne denne publikasjonen i ny fane eller vindu >>Unified speech and gesture synthesis using flow matching
Vise andre…
2024 (engelsk)Inngår i: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 8220-8224Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

sted, utgiver, år, opplag, sider
Institute of Electrical and Electronics Engineers (IEEE), 2024
Serie
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Emneord
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
HSV kategori
Identifikatorer
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Konferanse
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Merknad

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Tilgjengelig fra: 2025-04-02 Laget: 2025-04-02 Sist oppdatert: 2025-08-13bibliografisk kontrollert
Mehta, S., Wang, S., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2023). Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. In: Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble: . Paper presented at 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023 (pp. 150-156). International Speech Communication Association
Åpne denne publikasjonen i ny fane eller vindu >>Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis
Vise andre…
2023 (engelsk)Inngår i: Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, International Speech Communication Association , 2023, s. 150-156Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach.

sted, utgiver, år, opplag, sider
International Speech Communication Association, 2023
Emneord
Text-to-speech, speech-to-gesture, joint multimodal synthesis, deep generative model, diffusion model, evaluation
HSV kategori
Forskningsprogram
Datalogi; Datalogi
Identifikatorer
urn:nbn:se:kth:diva-368340 (URN)10.21437/SSW.2023-24 (DOI)
Konferanse
12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023
Forskningsfinansiär
Wallenberg AI, Autonomous Systems and Software Program (WASP), 3420 WASP SM GeH
Merknad

QC 20250813

Tilgjengelig fra: 2025-08-13 Laget: 2025-08-13 Sist oppdatert: 2025-08-13bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-1886-681X