kth.sePublikationer KTH
Driftmeddelande
För närvarande är det driftstörningar. Felsökning pågår.
Ändra sökning
Länk till posten
Permanent länk

Direktlänk
Publikationer (10 of 17) Visa alla publikationer
Mehta, S., Gamper, H. & Jojic, N. (2025). Make Some Noise: Towards LLM audio reasoning and generation using sound tokens. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): . Paper presented at 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025 (pp. 1-5). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
2025 (Engelska)Ingår i: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2025, s. 1-5Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2025
Nyckelord
audio language models, multimodal LLMs, audio reasoning, audio captioning, audio tokenization, audio generation
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
Datalogi; Datalogi
Identifikatorer
urn:nbn:se:kth:diva-368341 (URN)10.1109/ICASSP49660.2025.10888809 (DOI)2-s2.0-105003881005 (Scopus ID)979-8-3503-6874-1 (ISBN)979-8-3503-6874-1 (ISBN)
Konferens
2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025
Anmärkning

Part of ISBN 979-8-3503-6874-1, 979-8-3503-6875-8

QC 20250813

Tillgänglig från: 2025-08-13 Skapad: 2025-08-13 Senast uppdaterad: 2025-08-13Bibliografiskt granskad
Lodagala, V. S., Alkanhal, L., Izham, D., Mehta, S., Chowdhury, S., Makki, A., . . . Ali, A. (2025). SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4793-4797). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching
Visa övriga...
2025 (Engelska)Ingår i: Interspeech 2025, International Speech Communication Association , 2025, s. 4793-4797Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Curating Text-to-Speech (TTS) datasets is a strenuous task given the quality considerations. While it is hard to find high-quality TTS datasets in languages other than English, it is rare to come across code-switching (CS) datasets. As a part of this work, we curate a 4-hour Arabic-English TTS corpus consisting of code-switched Egyptian-English, monolingual Modern Standard Arabic (MSA), Egyptian, and English, all recorded by the same voice talent. We demonstrate the importance of vowelization and the need for better phonemization of Arabic text. To this effect, we present the modified espeak-ng phonemizer that handles various irregularities of espeak-ng over Arabic text. Upon training baseline TTS systems over this benchmark, we demonstrate its efficacy through extensive subjective evaluations.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2025
Nyckelord
Code-switching, Dialectal Speech, Multilingual, Phonemization, Text-to-Speech Synthesis
Nationell ämneskategori
Språkbehandling och datorlingvistik Studier av enskilda språk Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-372803 (URN)10.21437/Interspeech.2025-2573 (DOI)2-s2.0-105020056289 (Scopus ID)
Konferens
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Anmärkning

QC 20251113

Tillgänglig från: 2025-11-13 Skapad: 2025-11-13 Senast uppdaterad: 2025-11-13Bibliografiskt granskad
Tuttösí, P., Mehta, S., Syvenky, Z., Burkanova, B., Hfsafsti, M., Wang, Y., . . . Lim, A. (2025). Take a Look, it's in a Book, a Reading Robot. In: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction: . Paper presented at 20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025 (pp. 1803-1805). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Take a Look, it's in a Book, a Reading Robot
Visa övriga...
2025 (Engelska)Ingår i: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction, Institute of Electrical and Electronics Engineers (IEEE) , 2025, s. 1803-1805Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We demonstrate EmojiVoice, a free, customizable text-to-speech (TTS) toolkit for expressive speech on social robots. We demonstrate our voices through storytelling. This task is aimed to be deployed in classrooms, or libraries where the robot can read a story out loud to children. Moreover, we introduce adaptive clarity to to noisy environments and those with reduced comprehension ability. This storytelling robot voice allows us to demonstrate how, using our light weight and customizable TTS, we are able to have a voice that is expressive, engaging, clear and socially appropriate for the task, improving interactions with and perceptions of social robots.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2025
Nyckelord
clear speech synthesis, education robots, Expressive speech synthesis, human robot interaction, noise robust speech synthesis, second language speakers, social robotics, storytelling robots
Nationell ämneskategori
Robotik och automation Annan teknik Människa-datorinteraktion (interaktionsdesign) Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-363761 (URN)10.1109/HRI61500.2025.10973801 (DOI)2-s2.0-105004876693 (Scopus ID)
Konferens
20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025
Anmärkning

 Part of ISBN 9798350378931 QC 20250526

Tillgänglig från: 2025-05-21 Skapad: 2025-05-21 Senast uppdaterad: 2025-05-26Bibliografiskt granskad
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2815-2819Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
Nationell ämneskategori
Språkbehandling och datorlingvistik Datavetenskap (datalogi)
Identifikatorer
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Anmärkning

QC 20250128

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-12-08Bibliografiskt granskad
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Öppna denna publikation i ny flik eller fönster >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Visa övriga...
2024 (Engelska)Ingår i: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, s. 1952-1964Konferensbidrag, Publicerat paper (Refereegranskat)
Nationell ämneskategori
Datorsystem
Identifikatorer
urn:nbn:se:kth:diva-355103 (URN)
Konferens
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projekt
bodytalk
Anmärkning

QC 20241022

Tillgänglig från: 2024-10-22 Skapad: 2024-10-22 Senast uppdaterad: 2024-10-22Bibliografiskt granskad
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Visa övriga...
2024 (Engelska)Ingår i: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 1952-1964Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Nyckelord
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
Nationell ämneskategori
Signalbehandling Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Konferens
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Anmärkning

Part of ISBN 9798350365474

QC 20250715

Tillgänglig från: 2025-07-15 Skapad: 2025-07-15 Senast uppdaterad: 2025-08-13Bibliografiskt granskad
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Visa övriga...
2024 (Engelska)Ingår i: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 11341-11345Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Nyckelord
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Konferens
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Anmärkning

Part of ISBN 9798350344851

QC 20240716

Tillgänglig från: 2024-07-16 Skapad: 2024-07-16 Senast uppdaterad: 2025-08-13Bibliografiskt granskad
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Visa övriga...
2024 (Engelska)Ingår i: Interspeech 2024, International Speech Communication Association , 2024, s. 2285-2289Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2024
Nyckelord
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)001331850102086 ()2-s2.0-85214793947 (Scopus ID)
Konferens
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Anmärkning

QC 20250127

Tillgänglig från: 2025-01-23 Skapad: 2025-01-23 Senast uppdaterad: 2025-12-05Bibliografiskt granskad
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Öppna denna publikation i ny flik eller fönster >>Unified speech and gesture synthesis using flow matching
Visa övriga...
2024 (Engelska)Ingår i: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, s. 8220-8224Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

Ort, förlag, år, upplaga, sidor
Institute of Electrical and Electronics Engineers (IEEE), 2024
Serie
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Nyckelord
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
Nationell ämneskategori
Jämförande språkvetenskap och allmän lingvistik
Identifikatorer
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Konferens
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Anmärkning

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Tillgänglig från: 2025-04-02 Skapad: 2025-04-02 Senast uppdaterad: 2025-08-13Bibliografiskt granskad
Mehta, S., Wang, S., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2023). Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. In: Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble: . Paper presented at 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023 (pp. 150-156). International Speech Communication Association
Öppna denna publikation i ny flik eller fönster >>Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis
Visa övriga...
2023 (Engelska)Ingår i: Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, International Speech Communication Association , 2023, s. 150-156Konferensbidrag, Publicerat paper (Refereegranskat)
Abstract [en]

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach.

Ort, förlag, år, upplaga, sidor
International Speech Communication Association, 2023
Nyckelord
Text-to-speech, speech-to-gesture, joint multimodal synthesis, deep generative model, diffusion model, evaluation
Nationell ämneskategori
Signalbehandling
Forskningsämne
Datalogi; Datalogi
Identifikatorer
urn:nbn:se:kth:diva-368340 (URN)10.21437/SSW.2023-24 (DOI)
Konferens
12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023
Forskningsfinansiär
Wallenberg AI, Autonomous Systems and Software Program (WASP), 3420 WASP SM GeH
Anmärkning

QC 20250813

Tillgänglig från: 2025-08-13 Skapad: 2025-08-13 Senast uppdaterad: 2025-08-13Bibliografiskt granskad
Organisationer
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-1886-681X

Sök vidare i DiVA

Visa alla publikationer