Knowing which sounds can be produced by a simulated vocal model and how they are connected to its articulatory behavior is not trivial. Being able to map this out can be interesting for applications that make use of the extended capabilities of a voice, e.g., singing or vocal imitations. We present a method that achieves this for a state-of-the-art articulatory vocal model (VocalTractLab) by combining it with a recent Quality-Diversity algorithm (CMA-MAE) and audio embeddings obtained through a multi-modal pretrained model (CLAP). The text-capabilities of CLAP make it possible to steer the exploration through a text prompt. We show that the method explores more efficiently than a random sampling baseline, covering more of the measure space and achieving higher objective scores. We provide several listening examples and the source code for a scalable implementation.
Part of ISBN 9798400714658
QC 20250903