kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Exploring the Expressive Space of an Articulatory Vocal Modal using Quality-Diversity Optimization with Multimodal Embeddings
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0009-0009-9955-7143
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0009-0003-8553-3542
KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Speech, Music and Hearing, TMH.ORCID iD: 0000-0003-2549-6367
2025 (English)In: GECCO 2025 - Proceedings of the 2025 Genetic and Evolutionary Computation Conference, Association for Computing Machinery (ACM) , 2025, p. 1362-1370Conference paper, Published paper (Refereed)
Abstract [en]

Knowing which sounds can be produced by a simulated vocal model and how they are connected to its articulatory behavior is not trivial. Being able to map this out can be interesting for applications that make use of the extended capabilities of a voice, e.g., singing or vocal imitations. We present a method that achieves this for a state-of-the-art articulatory vocal model (VocalTractLab) by combining it with a recent Quality-Diversity algorithm (CMA-MAE) and audio embeddings obtained through a multi-modal pretrained model (CLAP). The text-capabilities of CLAP make it possible to steer the exploration through a text prompt. We show that the method explores more efficiently than a random sampling baseline, covering more of the measure space and achieving higher objective scores. We provide several listening examples and the source code for a scalable implementation.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM) , 2025. p. 1362-1370
Keywords [en]
articulatory vocal model, CLAP, CMA-MAE, diversity optimization, multimodal, quality-diversity, text prompt, VocalTractLab
National Category
Natural Language Processing Computer Sciences Comparative Language Studies and Linguistics Signal Processing
Identifiers
URN: urn:nbn:se:kth:diva-369365DOI: 10.1145/3712256.3726313ISI: 001556459900153Scopus ID: 2-s2.0-105013082602OAI: oai:DiVA.org:kth-369365DiVA, id: diva2:1994811
Conference
2025 Genetic and Evolutionary Computation Conference, GECCO 2025, Malaga, Spain, Jul 14 2025 - Jul 18 2025
Note

Part of ISBN 9798400714658

QC 20250903

Available from: 2025-09-03 Created: 2025-09-03 Last updated: 2025-12-05Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Grouwels, JorisJonason, NicolasSturm, Bob

Search in DiVA

By author/editor
Grouwels, JorisJonason, NicolasSturm, Bob
By organisation
Speech, Music and Hearing, TMH
Natural Language ProcessingComputer SciencesComparative Language Studies and LinguisticsSignal Processing

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 83 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf