kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Probabilistic Speech & Motion Synthesis: Towards More Expressive and Multimodal Generative Models
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0002-1886-681X
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Sustainable development
SDG 9: Industry, innovation and infrastructure
Abstract [en]

Human communication is richly multimodal, combining speech with co-speech gestures to convey meaning, intention, and affect. Both modalities are shaped by context and communicative intent, and exhibit substantial variability in timing, prosody, and motion. Accurately generating these behaviors from text presents a fundamental challenge in artificial intelligence. Traditional deterministic systems fall short in capturing this diversity, leading to oversmoothed, repetitive outputs that lack spontaneity. This thesis addresses these limitations by developing a sequence of probabilistic generative models for high-quality, efficient, and expressive synthesis of speech and co-speech gestures from textual input.

We begin by advancing probabilistic text-to-speech (TTS) through the integration of monotonic alignment and duration modeling via neural Hidden Markov Models (HMMs). These models replace attention mechanisms with a left-to-right HMM with emissions parameterized via neural networks and offer robust, data-efficient training with exact likelihood optimization and controllable prosody. Building on this foundation, we introduce OverFlow, a framework that combines neural HMMs with normalizing flows to model the complex, non-Gaussian distribution of speech acoustics. This enables fully probabilistic modeling and sampling of audio features with improved likelihood and naturalness. To achieve faster yet expressive synthesis, we present Matcha-TTS, a non-autoregressive (NAR) TTS system trained with optimal-transport conditional flow matching (OT-CFM). This model leverages efficient ODE-based sampling and a lightweight convolutional transformer architecture, significantly reducing the number of synthesis steps needed while maintaining high perceptual quality. We further investigate probabilistic duration modeling in the context of fast non-autoregressive TTS models and demonstrate that probabilistic modeling substantially benefits spontaneous speech synthesis, where duration variability is high and deterministic models underperform. Expanding from unimodal to multimodal generation, we explore the joint synthesis of speech and co-speech gesture. Diff-TTSG introduces a diffusion-based framework for integrated generation using double diffusion decoders, while Match-TTSG improves synthesis speed and coherence by extending OT-CFM to the multimodal domain with the help of a unified decoder. Match-TTSG learns the joint distribution over acoustic and gestural features, enabling synchronized and cross-modally appropriate output from a single generative process. To address data scarcity in multimodal corpora, we propose Fake it to make it, a two-stage strategy where synthetic data generated from powerful unimodal models is used to pretrain a multimodal generative system, yielding improved downstream performance. Finally, the thesis transitions to discrete audio modeling and large language models (LLMs). We propose LM-MSN, which combines variational quantization with flow-matching reconstruction to produce low-bitrate discrete audio tokens. This facilitates early fusion of audio and text tokens and enables multimodal LLM training for both audio comprehension and generation. Together, the contributions of this thesis represent a coherent progression from probabilistic speech synthesis to unified multimodal generation and scalable discrete modeling. By leveraging expressive generative modeling across modalities, we demonstrate how probabilistic modeling can overcome the limitations of deterministic synthesis and move towards more natural, controllable, and expressive communicative AI.

Abstract [sv]

Mänsklig kommunikation är multimodal och kombinerar tal med gester i samspråk för att förmedla mening, avsikt och känsla. Båda modaliteterna formas av sammanhanget och våra kommunikativa intentioner, och uppvisar stor variation i timing, prosodi och rörelse. Att korrekt syntetisera dessa beteenden från text är ett centralt problem inom artificiell intelligens. Traditionella, deterministiska system lyckas inte fånga denna mångfald, vilket leder till repetitiv och onaturligt utslätade utdata med bristande spontanitet. Denna avhandling bemöter dessa tillkortakommanden genom att utveckla en uppsättning probabilistiska generativa modeller för högkvalitativ, beräkningseffektiv och uttrycksfull syntes av tal och gester från textindata.

Först vidareutvecklar vi probabilistisk talsyntes (engelsk förkortning TTS) genom at integrera neurala dolda Markovmodeller (neurala HMM:er), vilka erbjuder varaktighetsmodeller och monoton matchning mellan utdata och utdata. Detta upplägg ersätter neurala uppmärksamhetsmekanismer i konventionell neural talsyntes med en vänster-till-höger HMM vars fördelningsfunktioner definieras av neurala nätverk och erbjuder robust, dataeffektiv träning med exakt sannolikhetsmaximering och kontrollerbar prosodi. Med denna modell som grund introducerar vi sedan OverFlow, ett ramverk som kombinerar neurala HMM:er med normaliserande flöden för att beskriva den komplexa, icke-Gaussiska fördelningen av akustiska särdrag hos tal. Detta möjliggör probabilistisk modellering och sampling av talakustik med förbättrad sannolikhet och naturlighet. För att erhålla snabbare men likväl uttrycksfull syntes presenterar vi Matcha-TTS, en icke-autoregressivt (engelsk förkortning NAR) TTS-metod som tränas med villkorlig flödesmatchning med optimal transportteori (så kallad OT-CFM). Denna modell kombinerar numeriskt lättlösta ordinära differentialekvationer (ODE) med en beräkningseffektiv transformerarkitektur, vilket avsevärt minskar antalet tidssteg som behövs vid syntes samtidigt som hög perceptuell kvalitet bibehålls. Vi undersöker vidare probabilistisk varaktighetsmodellering i samband med effektiva icke-autoregressiva text-till-talmodeller och visar att probabilistiska modeller signifikant gynnar spontan talsyntes, där det förkommer väsentligt variabel varaktighet och deterministiska modeller underpresterar. Vi expanderar från unimodal till multimodal output genom att utforska samtidig syntes av tal och samtalgester. Diff-TTSG introducerar ett diffusionsbaserat ramverk för at generera dessa två modaliteter parallellt i ett integrerat system med hjälp av dubbla diffusionsprocesser, medan Match-TTSG förbättrar synteshastighet och koherens genom att tillämpa OT-CFM på multimodala data med en gemensam ODE vid probabilistisk syntes. Match-TTSG lär sig den gemensamma fördelningen över ljud- och gestegenskaper, vilket möjliggör synkron och korsmodalt koherent utdata från en enda generativ process. För att hantera bristen på datamängder med alla modaliteter samtidigt lanserar vi Fake it to make it, en tvåstegsstrategi där syntetiska data genererade från kraftfulla modeller av en modalitet i taget används för att förträna ett multimodalt syntessystem, vilket ger förbättrat slutresultat. Slutligen behandlar avhandlingen diskreta modeller av ljuddata och stora språkmodeller (LLM:er i engelsk förkortning). Vi föreslår LM-MSN, som kombinerar kvantisering med flödesmatchningsrekonstruktion för att möjliggöra en diskret ljudrepresentation med låg bithastighet. Detta möjliggör multimodal LLM-träning på sekvenser med både text och diskreta ljudrepresentationer, för förståelse såväl som syntes av ljud. Tillsammans beskriver bidragen i denna avhandling en sammanhängande utveckling från probabilistisk talsyntes till enhetliga multimodala modeller och skalbar diskret modellering. Genom att använda expressiv, generativ modellering för ett flertal modaliteter demonstrerar vi hur probabilistiska metoder kan övervinna begränsningarna hos deterministisk syntes och leda till mer naturlig, kontrollerbar och expressiv kommunikativ AI.

Place, publisher, year, edition, pages
KTH Royal Institute of Technology, 2025. , p. 71
Series
TRITA-EECS-AVL ; 2025:76
Keywords [en]
text-to-speech, speech synthesis, co-speech gesture synthesis, multimodal synthesis, probabilistic generative models, neural hidden Markov models, HMMs, normalizing flows, durations modeling, diffusion models, score matching, conditional flow matching, OT-CFM, probabilistic duration modeling, spontaneous speech, large language models, LLMs, variational quantization, VQ-VAE, audio comprehension, audio generation.
Keywords [sv]
text till tal, talsyntes, gestsyntes, multimodal syntes, probabilistiska generativa modeller, neurala dolda Markovmodeller, HMM:er, normaliserande flöden, varaktighetsmodellering, diffusionsmodeller, score-matchning, betingad flödesmatchning, OT-CFM, probabilistisk varaktighetsmodellering, spontant tal, stora språkmodeller, LLM:er, variationell kvantisering, VQ-VAE, ljudförståelse, ljudsyntes.
National Category
Computer and Information Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-368342ISBN: 978-91-8106-360-8 (print)OAI: oai:DiVA.org:kth-368342DiVA, id: diva2:1988881
Public defence
2025-09-12, https://kth-se.zoom.us/j/69476396694, Kollegiesalen, Brinellvägen 8, KTH Campus, Stockholm, 13:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 3420 WASP SM GeH
Note

QC 20250814

Available from: 2025-08-14 Created: 2025-08-13 Last updated: 2025-08-27Bibliographically approved
List of papers
1. Neural HMMs are all you need (for high-quality attention-free TTS)
Open this publication in new window or tab >>Neural HMMs are all you need (for high-quality attention-free TTS)
2022 (English)In: 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE Signal Processing Society, 2022, p. 7457-7461Conference paper, Published paper (Refereed)
Abstract [en]

Neural sequence-to-sequence TTS has achieved significantly better output quality than statistical speech synthesis using HMMs. However, neural TTS is generally not probabilistic and uses non-monotonic attention. Attention failures increase training time and can make synthesis babble incoherently. This paper describes how the old and new paradigms can be combined to obtain the advantages of both worlds, by replacing attention in neural TTS with an autoregressive left-right no-skip hidden Markov model defined by a neural network. Based on this proposal, we modify Tacotron 2 to obtain an HMM-based neural TTS model with monotonic alignment, trained to maximise the full sequence likelihood without approximation. We also describe how to combine ideas from classical and contemporary TTS for best results. The resulting example system is smaller and simpler than Tacotron 2, and learns to speak with fewer iterations and less data, whilst achieving comparable naturalness prior to the post-net. Our approach also allows easy control over speaking rate.

Place, publisher, year, edition, pages
IEEE Signal Processing Society, 2022
Series
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ISSN 2379-190X
Keywords
seq2seq, attention, HMMs, duration modelling, acoustic modelling
National Category
Natural Language Processing Probability Theory and Statistics Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-312455 (URN)10.1109/ICASSP43922.2022.9746686 (DOI)000864187907152 ()2-s2.0-85131260082 (Scopus ID)
Conference
47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), MAY 23-27, 2022, Singapore, Singapore
Funder
Knut and Alice Wallenberg Foundation, WASP
Note

Part of proceedings: ISBN 978-1-6654-0540-9

QC 20220601

Available from: 2022-05-18 Created: 2022-05-18 Last updated: 2025-08-13Bibliographically approved
2. OverFlow: Putting flows on top of neural transducers for better TTS
Open this publication in new window or tab >>OverFlow: Putting flows on top of neural transducers for better TTS
Show others...
2023 (English)In: Interspeech 2023, International Speech Communication Association , 2023, p. 4279-4283Conference paper, Published paper (Refereed)
Abstract [en]

Neural HMMs are a type of neural transducer recently proposed for sequence-to-sequence modelling in text-to-speech. They combine the best features of classic statistical speech synthesis and modern neural TTS, requiring less data and fewer training updates, and are less prone to gibberish output caused by neural attention failures. In this paper, we combine neural HMM TTS with normalising flows for describing the highly non-Gaussian distribution of speech acoustics. The result is a powerful, fully probabilistic model of durations and acoustics that can be trained using exact maximum likelihood. Experiments show that a system based on our proposal needs fewer updates than comparable methods to produce accurate pronunciations and a subjective speech quality close to natural speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
acoustic modelling, Glow, hidden Markov models, invertible post-net, Probabilistic TTS
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-338584 (URN)10.21437/Interspeech.2023-1996 (DOI)001186650304087 ()2-s2.0-85167953412 (Scopus ID)
Conference
24th International Speech Communication Association, Interspeech 2023, August 20-24, 2023, Dublin, Ireland
Note

QC 20241014

Available from: 2023-11-07 Created: 2023-11-07 Last updated: 2025-08-13Bibliographically approved
3. Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis
Open this publication in new window or tab >>Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis
Show others...
2023 (English)In: Proceedings 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, International Speech Communication Association , 2023, p. 150-156Conference paper, Published paper (Refereed)
Abstract [en]

With read-aloud speech synthesis achieving high naturalness scores, there is a growing research interest in synthesising spontaneous speech. However, human spontaneous face-to-face conversation has both spoken and non-verbal aspects (here, co-speech gestures). Only recently has research begun to explore the benefits of jointly synthesising these two modalities in a single system. The previous state of the art used non-probabilistic methods, which fail to capture the variability of human speech and motion, and risk producing oversmoothing artefacts and sub-optimal synthesis quality. We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together. Our method can be trained on small datasets from scratch. Furthermore, we describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems, and use them to validate our proposed approach.

Place, publisher, year, edition, pages
International Speech Communication Association, 2023
Keywords
Text-to-speech, speech-to-gesture, joint multimodal synthesis, deep generative model, diffusion model, evaluation
National Category
Signal Processing
Research subject
Computer Science; Computer Science
Identifiers
urn:nbn:se:kth:diva-368340 (URN)10.21437/SSW.2023-24 (DOI)
Conference
12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France, August 26–28, 2023
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 3420 WASP SM GeH
Note

QC 20250813

Available from: 2025-08-13 Created: 2025-08-13 Last updated: 2025-08-13Bibliographically approved
4. MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Open this publication in new window or tab >>MATCHA-TTS: A FAST TTS ARCHITECTURE WITH CONDITIONAL FLOW MATCHING
Show others...
2024 (English)In: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024 - Proceedings, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 11341-11345Conference paper, Published paper (Refereed)
Abstract [en]

We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest model on long utterances, and attains the highest mean opinion score in a listening test.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
acoustic modelling, Diffusion models, flow matching, speech synthesis, text-to-speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-350551 (URN)10.1109/ICASSP48485.2024.10448291 (DOI)001396233804117 ()2-s2.0-85195024093 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2024, Seoul, Korea, Apr 14 2024 - Apr 19 2024
Note

Part of ISBN 9798350344851

QC 20240716

Available from: 2024-07-16 Created: 2024-07-16 Last updated: 2025-08-13Bibliographically approved
5. Unified speech and gesture synthesis using flow matching
Open this publication in new window or tab >>Unified speech and gesture synthesis using flow matching
Show others...
2024 (English)In: 2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024), Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 8220-8224Conference paper, Published paper (Refereed)
Abstract [en]

As text-to-speech technologies achieve remarkable naturalness in read-aloud tasks, there is growing interest in multimodal synthesis of verbal and non-verbal communicative behaviour, such as spontaneous speech and associated body gestures. This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text, trained using optimal-transport conditional flow matching (OT-CFM). The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures, generating both modalities together in one single process. The new training regime, meanwhile, enables better synthesis quality in much fewer steps (network evaluations) than before. Uni- and multimodal subjective tests demonstrate improved speech naturalness, gesture human-likeness, and cross-modal appropriateness compared to existing benchmarks.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Series
International Conference on Acoustics Speech and Signal Processing ICASSP, ISSN 1520-6149
Keywords
Text-to-speech, co-speech gestures, speech-to-gesture, integrated speech and gesture synthesis, ODE models
National Category
Comparative Language Studies and Linguistics
Identifiers
urn:nbn:se:kth:diva-361616 (URN)10.1109/ICASSP48485.2024.10445998 (DOI)001396233801103 ()2-s2.0-105001488767 (Scopus ID)
Conference
49th IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), APR 14-19, 2024, Seoul, SOUTH KOREA
Note

Part of ISBN 979-8-3503-4486-8,  979-8-3503-4485-1

QC 20250402

Available from: 2025-04-02 Created: 2025-04-02 Last updated: 2025-08-13Bibliographically approved
6. Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Open this publication in new window or tab >>Fake it to make it: Using synthetic data to remedy the data shortage in joint multi-modal speech-and-gesture synthesis
Show others...
2024 (English)In: Proceedings - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Institute of Electrical and Electronics Engineers (IEEE) , 2024, p. 1952-1964Conference paper, Published paper (Refereed)
Abstract [en]

Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally, methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like, efficient, expressive, and robust synthetic communication, but are currently held back by the lack of suitably large datasets, as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods, we propose a straightforward solution to the data shortage, by simply synthesising additional training material. Specifically, we use uni-modal synthesis models trained on large datasets to create multi-modal (but synthetic) parallel training data, and then pre-train a joint synthesis model on that material. In addition, we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multi-modal model, with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
gesture synthesis, motion synthesis, multimodal synthesis, synthetic data, text-to-speech-and-motion, training-on-generated-data
National Category
Signal Processing Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-367174 (URN)10.1109/CVPRW63382.2024.00201 (DOI)001327781702011 ()2-s2.0-85202828403 (Scopus ID)
Conference
2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2024, Seattle, United States of America, Jun 16 2024 - Jun 22 2024
Note

Part of ISBN 9798350365474

QC 20250715

Available from: 2025-07-15 Created: 2025-07-15 Last updated: 2025-08-13Bibliographically approved
7. Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Open this publication in new window or tab >>Should you use a probabilistic duration model in TTS? Probably! Especially for spontaneous speech
Show others...
2024 (English)In: Interspeech 2024, International Speech Communication Association , 2024, p. 2285-2289Conference paper, Published paper (Refereed)
Abstract [en]

Converting input symbols to output audio in TTS requires modelling the durations of speech sounds. Leading non-autoregressive (NAR) TTS models treat duration modelling as a regression problem. The same utterance is then spoken with identical timings every time, unlike when a human speaks. Probabilistic models of duration have been proposed, but there is mixed evidence of their benefits. However, prior studies generally only consider speech read aloud, and ignore spontaneous speech, despite the latter being both a more common and a more variable mode of speaking. We compare the effect of conventional deterministic duration modelling to durations sampled from a powerful probabilistic model based on conditional flow matching (OT-CFM), in three different NAR TTS approaches: regression-based, deep generative, and end-to-end. Across four different corpora, stochastic duration modelling improves probabilistic NAR TTS approaches, especially for spontaneous speech.

Place, publisher, year, edition, pages
International Speech Communication Association, 2024
Keywords
conditional flow matching, duration modelling, probabilistic models, Speech synthesis, spontaneous speech
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-358878 (URN)10.21437/Interspeech.2024-1582 (DOI)001331850102086 ()2-s2.0-85214793947 (Scopus ID)
Conference
25th Interspeech Conferece 2024, Kos Island, Greece, Sep 1 2024 - Sep 5 2024
Note

QC 20250127

Available from: 2025-01-23 Created: 2025-01-23 Last updated: 2025-12-05Bibliographically approved
8. Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Open this publication in new window or tab >>Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
2025 (English)In: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2025, p. 1-5Conference paper, Published paper (Refereed)
Abstract [en]

Integrating audio comprehension and generation into large language models (LLMs) remains challenging due to the continuous nature of audio and the resulting high sampling rates. Here, we introduce a novel approach that combines Variational Quantization with Conditional Flow Matching to convert audio into ultra-low bitrate discrete tokens of 0.23kpbs, allowing for seamless integration with text tokens in LLMs. We fine-tuned a pretrained text-based LLM using Low-Rank Adaptation (LoRA) to assess its effectiveness in achieving true multimodal capabilities, i.e., audio comprehension and generation. Our tokenizer outperforms a traditional VQ-VAE across various datasets with diverse acoustic events. Despite the substantial loss of fine-grained details through audio tokenization, our multimodal LLM trained with discrete tokens achieves competitive results in audio comprehension with state-of-the-art methods, though audio generation is poor. Our results highlight the need for larger, more diverse datasets and improved evaluation metrics to advance multimodal LLM performance.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2025
Keywords
audio language models, multimodal LLMs, audio reasoning, audio captioning, audio tokenization, audio generation
National Category
Natural Language Processing
Research subject
Computer Science; Computer Science
Identifiers
urn:nbn:se:kth:diva-368341 (URN)10.1109/ICASSP49660.2025.10888809 (DOI)2-s2.0-105003881005 (Scopus ID)979-8-3503-6874-1 (ISBN)979-8-3503-6874-1 (ISBN)
Conference
2025 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2025, Hyderabad, India, April 6-11, 2025
Note

Part of ISBN 979-8-3503-6874-1, 979-8-3503-6875-8

QC 20250813

Available from: 2025-08-13 Created: 2025-08-13 Last updated: 2025-08-13Bibliographically approved

Open Access in DiVA

Kappa(3248 kB)261 downloads
File information
File name FULLTEXT01.pdfFile size 3248 kBChecksum SHA-512
d1ae0a51dfe2574e8e1ec490acfe725129e5a1d1974b8238c1f1ebf827836c1e41e486442f5aaf9e0dd3975c57960204d5c562155e9d624e945bfd7d7fc50d51
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Mehta, Shivam
By organisation
Speech, Music and Hearing, TMH
Computer and Information Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 262 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2156 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf