kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (10 of 13) Show all publications
Jonason, N. (2026). Bittersweet Lessons in Music AI Research: Neural Instrument Synthesis, Multi-modal Representations, Symbolic Music Generation. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>Bittersweet Lessons in Music AI Research: Neural Instrument Synthesis, Multi-modal Representations, Symbolic Music Generation
2026 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This compilation thesis explores AI techniques in three areas related to music making: neural instrument synthesis, multi-modal representations, and symbolic music generation. In neural instrument synthesis, we explore architectural changes and transfer learning techniques to apply neural synthesis methods to instruments where little data is available. We then move to zero-shot audio applications of multi-modal representations, including text-guided audio equalization, visualization of instrument sounds, and text-driven synthesizer programming. In the domain of symbolic music, we propose superposed language modelling, a generalisation of masked language modelling that enables controllable generation and editing of music using event-attribute domain constraints. We then experiment with text-driven music generation and editing with LLMs augmented with a retrieval system to fetch relevant few-shot examples, showing early signs that LLMs could challenge domain specific approaches to symbolic music generation. We then bridge the symbolic and audio domains by using an audio-domain model of human preferences as a reward to tune a symbolic music generation model, producing music which according to the preference model is better than Mozart. Reflecting on our work, we focus on data availability as the key factor in determining Music AI capabilities and that much of our work in this thesis can be seen as capability arbitrage: redirecting capabilities from data-rich domains towards data-poor domains. We conclude by speculating on music making capabilities of future AI considering the massive iceberg of data that remains unused.

Abstract [sv]

Denna avhandling utforskar AI-tekniker inom tre områden relaterade till musikskapande: neural instrumentsyntes, multimodala representationer och symbolisk musikgenerering. Inom neural instrumentsyntes utforskar vi arkitekturförändringar och överföringsinlärning för att tillämpa neurala syntesmetoder på instrument där lite data finns tillgänglig. Vi övergår sedan till zero-shot-ljudtillämpningar av multimodala representationer, inklusive textguidad ljudekvalisering, visualisering av instrumentljud och textdriven synthesizerprogrammering. Inom symbolisk musik föreslår vi superponerade språkmodeller, en generalisering av maskerade språkmodeller för kontrollerbar generering och redigering av musik med event-attribut-domänbegränsningar. Vi experimenterar sedan med textdriven musikgenerering och redigering med LLM:er förstärkta med ett retrieval-system för att hämta relevanta few-shot-exempel, ett tidigt tecken på att LLM:er kan utmana domänspecifika metoder för symbolisk musikgenerering. Vi överbryggar sedan de symboliska och ljuddomänerna genom att använda en ljuddomänmodell av mänskliga preferenser som belöningssignal för att finjustera en symbolisk musikgenereringsmodell, och producerar musik som enligt preferensmodellen är bättre än Mozart. I en reflektion kring vårt arbete lyfter vi datatillgänglighet som den avgörande faktorn för musik-AI:s förmågor, och att mycket av vårt arbete kan ses som capability arbitrage: en omdirigering av förmågor från datarika domäner mot datafattiga domäner. Vi avslutar med att spekulera kring framtida AI-förmågor för musikskapande med hänsyn till det massiva isberg av data som fortfarande inte nyttjas.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2026. p. xvi, 67
Series
TRITA-EECS-AVL ; 2026:19
Keywords
Artificial Intelligence, Machine Learning, Music
National Category
Artificial Intelligence
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-377801 (URN)978-91-8106-542-8 (ISBN)
Public defence
2026-03-27, https://kth-se.zoom.us/j/64932870406, F3 (Flodis), Lindstedtsvägen 26 & 28, Stockholm, Sweden, 15:00 (English)
Opponent
Supervisors
Funder
EU, Horizon 2020, 864189
Note

QC 20260306

Available from: 2026-03-06 Created: 2026-03-05 Last updated: 2026-03-30Bibliographically approved
Grouwels, J., Jonason, N. & Sturm, B. (2025). Exploring the Expressive Space of an Articulatory Vocal Modal using Quality-Diversity Optimization with Multimodal Embeddings. In: GECCO 2025 - Proceedings of the 2025 Genetic and Evolutionary Computation Conference: . Paper presented at 2025 Genetic and Evolutionary Computation Conference, GECCO 2025, Malaga, Spain, Jul 14 2025 - Jul 18 2025 (pp. 1362-1370). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Exploring the Expressive Space of an Articulatory Vocal Modal using Quality-Diversity Optimization with Multimodal Embeddings
2025 (English)In: GECCO 2025 - Proceedings of the 2025 Genetic and Evolutionary Computation Conference, Association for Computing Machinery (ACM) , 2025, p. 1362-1370Conference paper, Published paper (Refereed)
Abstract [en]

Knowing which sounds can be produced by a simulated vocal model and how they are connected to its articulatory behavior is not trivial. Being able to map this out can be interesting for applications that make use of the extended capabilities of a voice, e.g., singing or vocal imitations. We present a method that achieves this for a state-of-the-art articulatory vocal model (VocalTractLab) by combining it with a recent Quality-Diversity algorithm (CMA-MAE) and audio embeddings obtained through a multi-modal pretrained model (CLAP). The text-capabilities of CLAP make it possible to steer the exploration through a text prompt. We show that the method explores more efficiently than a random sampling baseline, covering more of the measure space and achieving higher objective scores. We provide several listening examples and the source code for a scalable implementation.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2025
Keywords
articulatory vocal model, CLAP, CMA-MAE, diversity optimization, multimodal, quality-diversity, text prompt, VocalTractLab
National Category
Natural Language Processing Computer Sciences Comparative Language Studies and Linguistics Signal Processing
Identifiers
urn:nbn:se:kth:diva-369365 (URN)10.1145/3712256.3726313 (DOI)001556459900153 ()2-s2.0-105013082602 (Scopus ID)
Conference
2025 Genetic and Evolutionary Computation Conference, GECCO 2025, Malaga, Spain, Jul 14 2025 - Jul 18 2025
Note

Part of ISBN 9798400714658

QC 20250903

Available from: 2025-09-03 Created: 2025-09-03 Last updated: 2025-12-05Bibliographically approved
Jonason, N., Casini, L. & Sturm, B. (2025). SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward. In: Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025): . Paper presented at 6th Conference on AI Music Creativity (AIMC 2025), Brussels, Belgium, September 10-12, 2025.
Open this publication in new window or tab >>SMART: Tuning a symbolic music generation system with an audio domain aesthetic reward
2025 (English)In: Proceedings of the 6th Conference on AI Music Creativity (AIMC 2025), 2025Conference paper, Published paper (Refereed)
Abstract [en]

Recent work has proposed training machine learning models to predict aesthetic ratings for music audio. Our work explores whether such models can be used to finetune a symbolic music generation system with reinforcement learning, and what effect this has on the system outputs. To test this, we use group relative policy optimization to finetune a piano MIDI model with Meta Audiobox Aesthetics ratings of audio-rendered outputs as the reward. We find that this optimization affects multiple low-level features of the generated outputs, and improves the average subjective ratings in a preliminary listening study with 14 participants. We also find that over-optimization dramatically reduces diversity of model outputs. Code and listening examples can be found here: https://github.com/erl-j/SMART.

Keywords
Artificial Intelligence, Music, Reinforcement Learning
National Category
Artificial Intelligence
Identifiers
urn:nbn:se:kth:diva-377799 (URN)10.5281/zenodo.16946387 (DOI)
Conference
6th Conference on AI Music Creativity (AIMC 2025), Brussels, Belgium, September 10-12, 2025
Funder
EU, Horizon 2020, 864189
Note

QC 20260306

Available from: 2026-03-05 Created: 2026-03-05 Last updated: 2026-03-06Bibliographically approved
Thomé, C., Sturm, B., Pertoft, J. & Jonason, N. (2024). Applying textual inversion to control and personalize text-to-music models. In: Proc. 15th Int. Workshop on Machine Learning and Music: . Paper presented at Int. Workshop on Machine Learning and Music.
Open this publication in new window or tab >>Applying textual inversion to control and personalize text-to-music models
2024 (English)In: Proc. 15th Int. Workshop on Machine Learning and Music, 2024Conference paper, Published paper (Refereed)
Abstract [en]

A text-to-music (TTM) model should synthesize audio that reflects the concepts in a given prompt as long as it has been trained on those concepts. If a prompt references concepts that the TTM model has not been trained on then the audio it synthesizes will likely not match. This paper investigates the application of a simple gradient-based approach called textual inversion (TI) to expand the concept vocabulary of a trained TTM model without compromising the fidelity of concepts on which it has already been trained. We apply this technique to MusicGen and measure its reconstruction and editability quality, as well as its subjective quality. We see TI can expand the concept vocabulary of a pretrained TTM model, thus making it personalized and more controllable without having to finetune the entire model. 

National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Identifiers
urn:nbn:se:kth:diva-356224 (URN)
Conference
Int. Workshop on Machine Learning and Music
Funder
EU, Horizon 2020, 864189
Note

QC 20241113

Available from: 2024-11-12 Created: 2024-11-12 Last updated: 2024-11-13Bibliographically approved
Jonason, N., Wang, X., Cooper, E., Juvela, L., Sturm, B. & Yamagishi, J. (2024). DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input. In: Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24): . Paper presented at Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September, 2024.
Open this publication in new window or tab >>DDSP-based Neural Waveform Synthesis of Polyphonic Guitar Performance from String-wise MIDI Input
Show others...
2024 (English)In: Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), 2024Conference paper, Published paper (Refereed)
Abstract [en]

We explore the use of neural synthesis for acoustic guitar from string-wise MIDI input. We propose four different systems and compare them with both objective metrics and subjective evaluation against natural audio and a sample-based baseline. We aiteratively develop these four systems by making various considerations on the architecture and intermediate tasks, such as predicting pitch and loudness control features. We find that formulating the control feature prediction task as a classification task rather than a regression task yields better results. Furthermore, we find that our simplest proposed system, which directly predicts synthesis parameters from MIDI input performs the best out of the four proposed systems. Audio examples and code are available.

National Category
Computer and Information Sciences
Identifiers
urn:nbn:se:kth:diva-350217 (URN)2-s2.0-85210235985 (Scopus ID)
Conference
Proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), Guildford, United Kingdom, 3 - 7 September, 2024
Funder
EU, Horizon 2020, 864189
Note

QC 20241205

Available from: 2024-07-08 Created: 2024-07-08 Last updated: 2026-03-05Bibliographically approved
Casini, L., Jonason, N. & Sturm, B. (2024). Investigating the Viability of Masked Language Modeling for Symbolic Music Generation in abc-notation. In: Johnson, C Rebelo, SM Santos, I (Ed.), ARTIFICIAL INTELLIGENCE IN MUSIC, SOUND, ART AND DESIGN, EVOMUSART 2024: . Paper presented at 13th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART) Held as Part of EvoStar Conference, APR 03-05, 2024, Aberystwyth, WALES (pp. 84-96). Springer Nature, 14633
Open this publication in new window or tab >>Investigating the Viability of Masked Language Modeling for Symbolic Music Generation in abc-notation
2024 (English)In: ARTIFICIAL INTELLIGENCE IN MUSIC, SOUND, ART AND DESIGN, EVOMUSART 2024 / [ed] Johnson, C Rebelo, SM Santos, I, Springer Nature , 2024, Vol. 14633, p. 84-96Conference paper, Published paper (Refereed)
Abstract [en]

The dominating approach for modeling sequences (e.g. text, music) with deep learning is the causal approach, which consists in learning to predict tokens sequentially given those preceding it. Another paradigm is masked language modeling, which consists of learning to predict the masked tokens of a sequence in no specific order, given all non-masked tokens. Both approaches can be used for generation, but the latter is more flexible for editing, e.g. changing the middle of a sequence. This paper investigates the viability of masked language modeling applied to Irish traditional music represented in the text-based format abc-notation. Our model, called abcMLM, enables a user to edit tunes in arbitrary ways while retaining similar generation capabilities to causal models. We find that generation using masked language modeling is more challenging, but leveraging additional information from a dataset, e.g., imputing musical structure, can generate sequences that are on par with previous models.

Place, publisher, year, edition, pages
Springer Nature, 2024
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 14633
Keywords
Symbolic Music Generation, Masked Language Models, Irish Traditional Music
National Category
Natural Language Processing
Identifiers
urn:nbn:se:kth:diva-347151 (URN)10.1007/978-3-031-56992-0_6 (DOI)001212363900006 ()2-s2.0-85190687279 (Scopus ID)
Conference
13th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART) Held as Part of EvoStar Conference, APR 03-05, 2024, Aberystwyth, WALES
Note

QC 20240604

Part of ISBN 978-3-031-56991-3; 978-3-031-56992-0

Available from: 2024-06-04 Created: 2024-06-04 Last updated: 2025-02-07Bibliographically approved
Casini, L., Jonason, N. & Sturm, B. (2024). Sparks of Musical AGI? Challenges and perspectives in music co-creation with LLMs: A qualitative exploration of the music knowledge of LLMs and their use for music creation. In: : . Paper presented at International Conference on AI and Musical Creativity (AIMC) 2024, Oxford UK, 9 - 11 September 2024.
Open this publication in new window or tab >>Sparks of Musical AGI? Challenges and perspectives in music co-creation with LLMs: A qualitative exploration of the music knowledge of LLMs and their use for music creation
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In the paper Sparks of Artificial General Intelligence, the authors show how OpenAI’s GPT-4 is able do well in variety of tasks that be represented with text and claim it to have “a more general intelligence than previous AI models.” One of the tasks they explore is symbolic music generation. In this paper we critically analyze their results and extend the discourse around the capabilities of LLMs for music by exploring additional musical tasks and LLMs. Furthermore, we will investigate the viability of smaller models when used in conjunction with Retrieval Augmented Generation, as well as finetuning on programmatically written prompts using Quantized Low Rank Adapters. Finally, we discuss some critical aspects of LLMs as a tool for music generation.

Keywords
Large Language Models, Music Co-Creation, Music Understanding, Finetuning
National Category
Computer and Information Sciences Music
Identifiers
urn:nbn:se:kth:diva-352705 (URN)
Conference
International Conference on AI and Musical Creativity (AIMC) 2024, Oxford UK, 9 - 11 September 2024
Note

QC 20240906

Available from: 2024-09-05 Created: 2024-09-05 Last updated: 2025-02-21Bibliographically approved
Jonason, N., Casini, L. & Sturm, B. (2024). Steer-by-prior Editing of Symbolic Music Loops. In: : . Paper presented at International Workshop on Machine Learning and Music, September 9, 2024 Vilnius, Lithuania.
Open this publication in new window or tab >>Steer-by-prior Editing of Symbolic Music Loops
2024 (English)Conference paper, Published paper (Refereed)
Abstract [en]

With the goal of building a system capable of controllablesymbolic music loop generation and editing, this paper explores a gen-eralisation of Masked Language Modelling we call Superposed LanguageModelling. Rather than input tokens being known or unknown, a Super-posed Language Model takes priors over the sequence as input, enablingus to apply various constraints to the generation at inference time. Afterdetailing our approach, we demonstrate our model across various editingtasks in the domain of multi-instrument MIDI loops. We end by high-lighting some limitations of the approach and avenues for future work.

Keywords
machine learning, music, masked language models, superposed language models, constraints, MIDI
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-363494 (URN)
Conference
International Workshop on Machine Learning and Music, September 9, 2024 Vilnius, Lithuania
Projects
EU, Horizon 2020, 864189
Funder
EU, European Research Council, 864189
Note

QC 20250520

Available from: 2025-05-16 Created: 2025-05-16 Last updated: 2026-03-05Bibliographically approved
Jonason, N., Casini, L., Thomé, C. & Sturm, B. (2023). Retrieval Augmented Generation of Symbolic Music with LLMs. In: Extended Abstracts for the Late-Breaking Demo Session of the 22nd Int. Society for Music Information Retrieval Conf.: . Paper presented at 22nd International Society for Music Information Retrieval Conference, Online, November 7-12, 2021.
Open this publication in new window or tab >>Retrieval Augmented Generation of Symbolic Music with LLMs
2023 (English)In: Extended Abstracts for the Late-Breaking Demo Session of the 22nd Int. Society for Music Information Retrieval Conf., 2023Conference paper, Poster (with or without abstract) (Other academic)
Abstract [en]

We explore the use of large language models (LLMs) for music generation using a retrieval system to select relevant few-shot examples. We find promising initial results for music generation in a dialogue with the user, especially considering the ease with which such a system can be implemented. The code is available online.

Keywords
Music, Artificial Intelligence, Symbolic music generation, Abc notation, Large Language Models
National Category
Artificial Intelligence
Identifiers
urn:nbn:se:kth:diva-377800 (URN)
Conference
22nd International Society for Music Information Retrieval Conference, Online, November 7-12, 2021
Funder
EU, Horizon 2020, 864189
Note

QC 20260306

Available from: 2026-03-05 Created: 2026-03-05 Last updated: 2026-03-06Bibliographically approved
Jonason, N. & Sturm, B. (2022). Audio Latent Space Cartography. In: Extended Abstracts for the Late-Breaking Demo Session of the 23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India, 2022.: . Paper presented at 23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India, 2022.
Open this publication in new window or tab >>Audio Latent Space Cartography
2022 (English)In: Extended Abstracts for the Late-Breaking Demo Session of the 23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India, 2022., 2022Conference paper, Poster (with or without abstract) (Other academic)
Abstract [en]

We explore the generation of visualisations of audio latent spaces using an audio-to-image generation pipeline. We believe this can help with the interpretability of audio latent spaces. We demonstrate a variety of results on the NSynth dataset. A web demo is available.

Keywords
AI, music, multimodality, visualisation
National Category
Artificial Intelligence
Identifiers
urn:nbn:se:kth:diva-377797 (URN)
Conference
23rd Int. Society for Music Information Retrieval Conf., Bengaluru, India, 2022
Funder
EU, Horizon 2020, 864189
Note

QC 20260306

Available from: 2026-03-05 Created: 2026-03-05 Last updated: 2026-03-06Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0009-0003-8553-3542

Search in DiVA

Show all publications