kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (10 of 17) Show all publications
Mehta, S., Gamper, H. & Jojic, N. (2025). Make Some Noise: Towards LLM audio reasoning and generation using sound tokens. In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP): . Paper presented at 2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025 (pp. 1-5). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
2025 (English)In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 1-5Conference paper, Published paper (Refereed)
Abstract [en]

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Keywords
audio language models, multimodal LLMs, audio reasoning, audio captioning, audio tokenization, audio generation
National Category
Natural Language Processing
Research subject
Computer Science; Computer Science
Identifiers
urn:nbn:se:kth:diva-368341 (URN)10.1109/ICASSP49660.2025.10888809 (DOI)2-s2.0-105003881005 (Scopus ID)979-8-3503-6874-1 (ISBN)979-8-3503-6874-1 (ISBN)
Conference
2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025
Note

Part of ISBN 979-8-3503-6874-1, 979-8-3503-6875-8

QC 20250813

Available from: 2025-08-13 Created: 2025-08-13 Last updated: 2025-08-13Bibliographically approved
Lodagala, V. S., Alkanhal, L., Izham, D., Mehta, S., Chowdhury, S., Makki, A., . . . Ali, A. (2025). SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching. In: Interspeech 2025: . Paper presented at 26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025 (pp. 4793-4797). International Speech Communication Association
Open this publication in new window or tab >>SawtArabi: A Benchmark Corpus for Arabic TTS. Standard, Dialectal and Code-Switching
Show others...
2025 (English)In: Interspeech 2025, International Speech Communication Association , 2025, p. 4793-4797Conference paper, Published paper (Refereed)
Abstract [en]

Curating Text-to-Speech (TTS) datasets is a strenuous task given the quality considerations. While it is hard to find high-quality TTS datasets in languages other than English, it is rare to come across code-switching (CS) datasets. As a part of this work, we curate a 4-hour Arabic-English TTS corpus consisting of code-switched Egyptian-English, monolingual Modern Standard Arabic (MSA), Egyptian, and English, all recorded by the same voice talent. We demonstrate the importance of vowelization and the need for better phonemization of Arabic text. To this effect, we present the modified espeak-ng phonemizer that handles various irregularities of espeak-ng over Arabic text. Upon training baseline TTS systems over this benchmark, we demonstrate its efficacy through extensive subjective evaluations.

Place, publisher, year, edition, pages
International Speech Communication Association, 2025
Keywords
Code-switching, Dialectal Speech, Multilingual, Phonemization, Text-to-Speech Synthesis
National Category
Natural Language Processing Studies of Specific Languages Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-372803 (URN)10.21437/Interspeech.2025-2573 (DOI)2-s2.0-105020056289 (Scopus ID)
Conference
26th Interspeech Conference 2025, Rotterdam, Netherlands, Kingdom of the, August 17-21, 2025
Note

QC 20251113

Available from: 2025-11-13 Created: 2025-11-13 Last updated: 2025-11-13Bibliographically approved
Tuttösí, P., Mehta, S., Syvenky, Z., Burkanova, B., Hfsafsti, M., Wang, Y., . . . Lim, A. (2025). Take a Look, it's in a Book, a Reading Robot. In: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction: . Paper presented at 20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025 (pp. 1803-1805). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Take a Look, it's in a Book, a Reading Robot
Show others...
2025 (English)In: HRI 2025 - Proceedings of the 2025 ACM/IEEE International Conference on Human-Robot Interaction, Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 1803-1805Conference paper, Published paper (Refereed)
Abstract [en]

We demonstrate EmojiVoice, a free, customizable text-to-speech (TTS) toolkit for expressive speech on social robots. We demonstrate our voices through storytelling. This task is aimed to be deployed in classrooms, or libraries where the robot can read a story out loud to children. Moreover, we introduce adaptive clarity to to noisy environments and those with reduced comprehension ability. This storytelling robot voice allows us to demonstrate how, using our light weight and customizable TTS, we are able to have a voice that is expressive, engaging, clear and socially appropriate for the task, improving interactions with and perceptions of social robots.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Keywords
clear speech synthesis, education robots, Expressive speech synthesis, human robot interaction, noise robust speech synthesis, second language speakers, social robotics, storytelling robots
National Category
Robotics and automation Other Engineering and Technologies Human Computer Interaction Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-363761 (URN)10.1109/HRI61500.2025.10973801 (DOI)2-s2.0-105004876693 (Scopus ID)
Conference
20th Annual ACM/IEEE International Conference on Human-Robot Interaction, HRI 2025, Melbourne, Australia, Mar 4 2025 - Mar 6 2025
Note

 Part of ISBN 9798350378931 QC 20250526

Available from: 2025-05-21 Created: 2025-05-21 Last updated: 2025-05-26Bibliographically approved
Tånnander, C., Mehta, S., Beskow, J. & Edlund, J. (2024). Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2815-2819). International Speech Communication Association
Open this publication in new window or tab >>Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2815-2819Conference paper, Published paper (Refereed)
Abstract [en]

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes. In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS. Effectiveness was assessed by investigating two selected features in two ways: through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception, and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
analysis-by-synthesis, controllable text-to-speech synthesis, phonological features
National Category
Natural Language Processing Computer Sciences
Identifiers
urn:nbn:se:kth:diva-358877 (URN)10.21437/Interspeech.2024-1565 (DOI)001331850102192 ()2-s2.0-85214785956 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250128

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-08Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition: . Paper presented at IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1952-1964).
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, p. 1952-1964Conference paper, Published paper (Refereed)
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-355103 (URN)
Conference
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Projects
bodytalk
Note

QC 20241022

Available from: 2024-10-22 Created: 2024-10-22 Last updated: 2024-10-22Bibliographically approved
Mehta, S., Deichler, A., O'Regan, J., Moëll, B., Beskow, J., Henter, G. E. & Alexanderson, S. (2024). Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis. In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024: . Paper presented at 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024 (pp. 1952-1964). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1952-1964Conference paper, Published paper (Refereed)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
National Category
Signal Processing Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Note

Part of ISBN 9798350365474

QC 20250715

Available from: 2025-07-15 Created: 2025-07-15 Last updated: 2025-08-13Bibliographically approved
Mehta, S., Tu, R., Beskow, J., Székely, É. & Henter, G. E. (2024). MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING. In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings: . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024 (pp. 11341-11345). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Show others...
2024 (English)In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 11341-11345Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Note

Part of ISBN 9798350344851

QC 20240716

Available from: 2024-07-16 Created: 2024-07-16 Last updated: 2025-08-13Bibliographically approved
Mehta, S., Lameris, H., Punmiya, R., Beskow, J., Székely, É. & Henter, G. E. (2024). Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech. In: Interspeech 2024: . Paper presented at 25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024 (pp. 2285-2289). International Speech Communication Association
Open this publication in new window or tab >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Show others...
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2285-2289Conference paper, Published paper (Refereed)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)001331850102086 ()2-s2.0-85214793947 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-05Bibliographically approved
Mehta, S., Tu, R., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2024). Unified speech and gesture synthesis using flow matching. In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024): . Paper presented at 49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA (pp. 8220-8224). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Unified speech and gesture synthesis using flow matching
Show others...
2024 (English)In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 8220-8224Conference paper, Published paper (Refereed)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
National Category
Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Note

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Available from: 2025-04-02 Created: 2025-04-02 Last updated: 2025-08-13Bibliographically approved
Mehta, S., Wang, S., Alexanderson, S., Beskow, J., Székely, É. & Henter, G. E. (2023). Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis. In: Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble: . Paper presented at 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023 (pp. 150-156). International Speech Communication Association
Open this publication in new window or tab >>Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis
Show others...
2023 (English)In: Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, International Speech Communication Association , 2023, p. 150-156Conference paper, Published paper (Refereed)
Abstract [en]

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
Text-to-speech, speech-to-gesture, joint multimodal synthesis, deep generative model, diffusion model, evaluation
National Category
Signal Processing
Research subject
Computer Science; Computer Science
Identifiers
urn:nbn:se:kth:diva-368340 (URN)10.21437/SSW.2023-24 (DOI)
Conference
12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 3420 WASP SM GeH
Note

QC 20250813

Available from: 2025-08-13 Created: 2025-08-13 Last updated: 2025-08-13Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-1886-681X

Search in DiVA

Show all publications