Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Modeling Music: Studies of Music Transcription, Music Perception and Music Production
KTH, School of Electrical Engineering and Computer Science (EECS), Speech, Music and Hearing, TMH, Music Acoustics.ORCID iD: 0000-0002-4957-2128
2018 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This dissertation presents ten studies focusing on three important subfields of music information retrieval (MIR): music transcription (Part A), music perception (Part B), and music production (Part C).

In Part A, systems capable of transcribing rhythm and polyphonic pitch are described. The first two publications present methods for tempo estimation and beat tracking. A method is developed for computing the most salient periodicity (the “cepstroid”), and the computed cepstroid is used to guide the machine learning processing. The polyphonic pitch tracking system uses novel pitch-invariant and tone-shift-invariant processing techniques. Furthermore, the neural flux is introduced – a latent feature for onset and offset detection. The transcription systems use a layered learning technique with separate intermediate networks of varying depth.  Important music concepts are used as intermediate targets to create a processing chain with high generalization. State-of-the-art performance is reported for all tasks.

Part B is devoted to perceptual features of music, which can be used as intermediate targets or as parameters for exploring fundamental music perception mechanisms. Systems are proposed that can predict the perceived speed and performed dynamics of an audio file with high accuracy, using the average ratings from around 20 listeners as ground truths. In Part C, aspects related to music production are explored. The first paper analyzes long-term average spectrum (LTAS) in popular music. A compact equation is derived to describe the mean LTAS of a large dataset, and the variation is visualized. Further analysis shows that the level of the percussion is an important factor for LTAS. The second paper examines songwriting and composition through the development of an algorithmic composer of popular music. Various factors relevant for writing good compositions are encoded, and a listening test employed that shows the validity of the proposed methods.

The dissertation is concluded by Part D - Looking Back and Ahead, which acts as a discussion and provides a road-map for future work. The first paper discusses the deep layered learning (DLL) technique, outlining concepts and pointing out a direction for future MIR implementations. It is suggested that DLL can help generalization by enforcing the validity of intermediate representations, and by letting the inferred representations establish disentangled structures supporting high-level invariant processing. The second paper proposes an architecture for tempo-invariant processing of rhythm with convolutional neural networks. Log-frequency representations of rhythm-related activations are suggested at the main stage of processing. Methods relying on magnitude, relative phase, and raw phase information are described for a wide variety of rhythm processing tasks.

Abstract [sv]

Denna avhandling presenterar tio studier inom tre viktiga delområden av forskningsområdet ”Music Information Retrieval” (MIR) – ett forskningsområde fokuserat på att extrahera information från musik. Del A riktar in sig på musiktranskription, del B på musikperception och del C på musikproduktion. En avslutande del diskuterar maskininlärningsmetodiken och spanar framåt (del D).

I del A presenteras system som kan transkribera musik med hänsyn till rytm och polyfon tonhöjd. De två första publikationerna beskriver metoder för att estimera tempo och positionen av taktslag i ljudande musik. En metod för att beräkna den mest framstående periodiciteten (”cepstroiden”) beskrivs, samt hur denna kan användas för att guida de applicerade maskininlärningssystemen.  Systemet för polyfon tonhöjdsestimering kan både identifiera ljudande toner samt notstarter- och slut. Detta system är både tonhöjdsinvariant samt invariant med hänseende till variationer över tid inom ljudande toner. Transkriptionssystemen tränas till att predicera flera musikaspekter i en hierarkisk struktur. Transkriptionsresultaten är de bästa som rapporterats i tester på flera olika dataset.

Del B fokuserar på perceptuella särdrag i musik. Dessa kan prediceras för att modellera fundamentala perceptionsaspekter, men de kan också användas som representationer i modeller som försöker klassificera övergripande musikparametrar. Modeller presenteras som kan predicera den upplevda hastigheten samt den upplevda dynamiken i utförandet med hög precision. Medelvärdesbildade skattningar från omkring 20 lyssnare utgör målvärden under träning och evaluering.

I del C utforskas aspekter relaterade till musikproduktion. Den första studien analyserar variationer i medelvärdesspektrum mellan populärmusikaliska musikstycken. Analysen visar att nivån på perkussiva instrument är en viktig faktor för spektrumdistributionen – data antyder att denna nivå är bättre att använda än genreklassificeringar för att förutsäga spektrum. Den andra studien i del C behandlar musikkomposition. Ett algoritmiskt kompositionsprogram presenteras, där relevanta musikparametrar fogas samman en hierarkisk struktur. Ett lyssnartest genomförs för att påvisa validiteten i programmet och undersöka effekten av vissa parametrar.

Avhandlingen avslutas med del D, vilken placerar den utvecklade maskininlärningstekniken i ett vidare sammanhang och föreslår nya metoder för att generalisera rytmprediktion. Den första studien diskuterar djupinlärningssystem som predicerar olika musikaspekter i en hierarkisk struktur. Relevanta koncept presenteras tillsammans med förslag för framtida implementationer. Den andra studien föreslår en tempoinvariant metod för att processa log-frekvensdomänen av rytmsignaler med så kallade convolutional neural networks. Den föreslagna arkitekturen kan använda sig av magnitud, relative fas mellan rytmkanaler, samt ursprunglig fas från frekvenstransformen för att ta sig an flera viktiga problem relaterade till rytm.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2018. , p. 49
Series
TRITA-EECS-AVL ; 2018-35
Keywords [en]
Music Information Retrieval, MIR, Music, Music Transcription, Music Perception, Music Production, Tempo Estimation, Beat Tracking, Polyphonic Pitch Tracking, Polyphonic Transcription, Music Speed, Music Dynamics, Long-time average spectrum, LTAS, Algorithmic Composition, Deep Layered Learning, Convolutional Neural Networks, Rhythm Tracking, Ensemble Learning, Perceptual Features, Representation Learning
National Category
Other Computer and Information Science Computer Engineering Media and Communication Technology
Identifiers
URN: urn:nbn:se:kth:diva-226894ISBN: 978-91-7729-768-0 (print)OAI: oai:DiVA.org:kth-226894DiVA, id: diva2:1201904
Public defence
2018-05-18, D3, Kungliga Tekniska Högskolan, Lindstedtsvägen 5, Stockholm, 13:00 (English)
Opponent
Supervisors
Note

QC 20180427

Available from: 2018-04-27 Created: 2018-04-26 Last updated: 2018-05-03Bibliographically approved
List of papers
1. Modeling the perception of tempo
Open this publication in new window or tab >>Modeling the perception of tempo
2015 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 137, no 6, p. 3163-3177Article in journal (Refereed) Published
Abstract [en]

A system is proposed in which rhythmic representations are used to model the perception of tempo in music. The system can be understood as a five-layered model, where representations are transformed into higher-level abstractions in each layer. First, source separation is applied (Audio Level), onsets are detected (Onset Level), and interonset relationships are analyzed (Interonset Level). Then, several high-level representations of rhythm are computed (Rhythm Level). The periodicity of the music is modeled by the cepstroid vector-the periodicity of an interonset interval (IOI)-histogram. The pulse strength for plausible beat length candidates is defined by computing the magnitudes in different IOI histograms. The speed of the music is modeled as a continuous function on the basis of the idea that such a function corresponds to the underlying perceptual phenomena, and it seems to effectively reduce octave errors. By combining the rhythmic representations in a logistic regression framework, the tempo of the music is finally computed (Tempo Level). The results are the highest reported in a formal benchmarking test (2006-2013), with a P-Score of 0.857. Furthermore, the highest results so far are reported for two widely adopted test sets, with an Acc1 of 77.3% and 93.0% for the Songs and Ballroom datasets.

National Category
Computer and Information Sciences
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-171154 (URN)10.1121/1.4919306 (DOI)000356622400033 ()26093407 (PubMedID)2-s2.0-84934898408 (Scopus ID)
Note

Qc 20150720

Available from: 2015-07-20 Created: 2015-07-20 Last updated: 2018-09-13Bibliographically approved
2. Beat Tracking with a Cepstroid Invariant Neural Network
Open this publication in new window or tab >>Beat Tracking with a Cepstroid Invariant Neural Network
2016 (English)In: 17th International Society for Music Information Retrieval Conference (ISMIR 2016), International Society for Music Information Retrieval , 2016, p. 351-357Conference paper, Published paper (Refereed)
Abstract [en]

We present a novel rhythm tracking architecture that learns how to track tempo and beats through layered learning. A basic assumption of the system is that humans understand rhythm by letting salient periodicities in the music act as a framework, upon which the rhythmical structure is interpreted. Therefore, the system estimates the cepstroid (the most salient periodicity of the music), and uses a neural network that is invariant with regards to the cepstroid length. The input of the network consists mainly of features that capture onset characteristics along time, such as spectral differences. The invariant proper-ties of the network are achieved by subsampling the input vectors with a hop size derived from a musically relevant subdivision of the computed cepstroid of each song. The output is filtered to detect relevant periodicities and then used in conjunction with two additional networks, which estimates the speed and tempo of the music, to predict the final beat positions. We show that the architecture has a high performance on music with public annotations. 

Place, publisher, year, edition, pages
International Society for Music Information Retrieval, 2016
Keywords
Beat Tracking
National Category
Computer Sciences Other Computer and Information Science
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-195348 (URN)
Conference
17th International Society for Music Information Retrieval Conference (ISMIR 2016); New York City, USA, 7-11 August, 2016.
Funder
Swedish Research Council, 2012 - 4685
Note

QC 20161107

Available from: 2016-11-02 Created: 2016-11-02 Last updated: 2018-04-26Bibliographically approved
3. Polyphonic Pitch Tracking with Deep Layered Learning
Open this publication in new window or tab >>Polyphonic Pitch Tracking with Deep Layered Learning
(English)Manuscript (preprint) (Other academic)
Abstract [en]

This paper presents a polyphonic pitch tracking system able to extract both framewise and note-based estimates from audio. The system uses six artificial neural networks in a deep layered learning setup. First, cascading networks are applied to a spectrogram for framewise fundamental frequency (f0) estimation. A sparse receptive field is learned by the first network and then used for weight-sharing throughout the system. The f0 activations are connected across time to extract pitch ridges. These ridges define a framework, within which subsequent networks perform tone-shift-invariant onset and offset detection. The networks convolve the pitch ridges across time, using as input, e.g., variations of latent representations from the f0 estimation networks, defined as the “neural flux.” Finally, incorrect tentative notes are removed one by one in an iterative procedure that allows a network to classify notes within an accurate context. The system was evaluated on four public test sets: MAPS, Bach10, TRIOS, and the MIREX Woodwind quintet, and performed state-of-the-art results for all four datasets. It performs well across all subtasks: f0, pitched onset, and pitched offset tracking.

National Category
Other Computer and Information Science
Research subject
Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-226891 (URN)
Note

QC 20180427

arXiv preprint arXiv:1804.02918

Available from: 2018-04-26 Created: 2018-04-26 Last updated: 2018-04-27Bibliographically approved
4. Using listener-based perceptual features as intermediate representations in music information retrieval
Open this publication in new window or tab >>Using listener-based perceptual features as intermediate representations in music information retrieval
Show others...
2014 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 136, no 4, p. 1951-1963Article in journal (Refereed) Published
Abstract [en]

The notion of perceptual features is introduced for describing general music properties based on human perception. This is an attempt at rethinking the concept of features, aiming to approach the underlying human perception mechanisms. Instead of using concepts from music theory such as tones, pitches, and chords, a set of nine features describing overall properties of the music was selected. They were chosen from qualitative measures used in psychology studies and motivated from an ecological approach. The perceptual features were rated in two listening experiments using two different data sets. They were modeled both from symbolic and audio data using different sets of computational features. Ratings of emotional expression were predicted using the perceptual features. The results indicate that (1) at least some of the perceptual features are reliable estimates; (2) emotion ratings could be predicted by a small combination of perceptual features with an explained variance from 75% to 93% for the emotional dimensions activity and valence; (3) the perceptual features could only to a limited extent be modeled using existing audio features. Results clearly indicated that a small number of dedicated features were superior to a "brute force" model using a large number of general audio features.

Keywords
Communication, Performance, Emotion, Speech, Loudness, Timbre, Tempo, Model, Pitch, Mood
National Category
Computer Sciences Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:kth:diva-158173 (URN)10.1121/1.4892767 (DOI)000345977400059 ()25324094 (PubMedID)2-s2.0-84907863477 (Scopus ID)
Funder
Swedish Research Council, 2009-4285 2012-4685
Note

QC 20150108

Available from: 2014-12-30 Created: 2014-12-30 Last updated: 2018-04-26Bibliographically approved
5. Modelling the Speed of Music Using Features from Harmonic/Percussive Separated Audio
Open this publication in new window or tab >>Modelling the Speed of Music Using Features from Harmonic/Percussive Separated Audio
2013 (English)In: Proceedings of the 14th International Society for Music Information Retrieval Conference, 2013, p. 481-486Conference paper, Published paper (Refereed)
Abstract [en]

One of the major parameters in music is the overall speed of a musical performance. In this study, a computational model of speed in music audio has been developed using a custom set of rhythmic features. Speed is often associ-ated with tempo, but as shown in this study, factors such as note density (onsets per second) and spectral flux are important as well. The original audio was first separated into a harmonic part and a percussive part and the fea-tures were extracted separately from the different layers. In previous studies, listeners had rated the speed of 136 songs, and the ratings were used in a regression to evalu-ate the validity of the model as well as to find appropriate features. The final models, consisting of 5 or 8 features, were able to explain about 90% of the variation in the training set, with little or no degradation for the test set.

National Category
Computer and Information Sciences
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-137411 (URN)978-0-615-90065-0 (ISBN)
Conference
14th International Society for Music Information Retrieval Conference (ISMIR 2013); Curitiba, Brazil, 4-8 November, 2013
Note

QC 20140213

Available from: 2013-12-13 Created: 2013-12-13 Last updated: 2018-09-13Bibliographically approved
6. Predicting the perception of performed dynamics in music audio with ensemble learning
Open this publication in new window or tab >>Predicting the perception of performed dynamics in music audio with ensemble learning
2017 (English)In: Journal of the Acoustical Society of America, ISSN 0001-4966, E-ISSN 1520-8524, Vol. 141, no 3, p. 2224-2242Article in journal (Refereed) Published
Abstract [en]

By varying the dynamics in a musical performance, the musician can convey structure and different expressions. Spectral properties of most musical instruments change in a complex way with the performed dynamics, but dedicated audio features for modeling the parameter are lacking. In this study, feature extraction methods were developed to capture relevant attributes related to spectral characteristics and spectral fluctuations, the latter through a sectional spectral flux. Previously, ground truths ratings of performed dynamics had been collected by asking listeners to rate how soft/loud the musicians played in a set of audio files. The ratings, averaged over subjects, were used to train three different machine learning models, using the audio features developed for the study as input. The highest result was produced from an ensemble of multilayer perceptrons with an R2 of 0.84. This result seems to be close to the upper bound, given the estimated uncertainty of the ground truth data. The result is well above that of individual human listeners of the previous listening experiment, and on par with the performance achieved from the average rating of six listeners. Features were analyzed with a factorial design, which highlighted the importance of source separation in the feature extraction.

Place, publisher, year, edition, pages
Acoustical Society of America (ASA), 2017
Keywords
Performed dynamics, dynamics, music, timbre, ensemble learning, perceptual features
National Category
Computer and Information Sciences
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-204657 (URN)10.1121/1.4978245 (DOI)000398962500101 ()2-s2.0-85016561050 (Scopus ID)
Funder
Swedish Research Council
Note

QC 20170406

Available from: 2017-03-30 Created: 2017-03-30 Last updated: 2018-09-13Bibliographically approved
7. Long-term Average Spectrum in Popular Music and its Relation to the Level of the Percussion
Open this publication in new window or tab >>Long-term Average Spectrum in Popular Music and its Relation to the Level of the Percussion
2017 (English)In: AES 142nd Convention, Berlin, Germany, 2017Conference paper, Published paper (Refereed)
Abstract [en]

The spectral distribution of music audio has an important influence on listener perception, but large-scale charac- terizations are lacking. Therefore, the long-term average spectrum (LTAS) was analyzed for a large dataset of popular music. The mean LTAS was computed, visualized, and then approximated with two quadratic fittings. The fittings were subsequently used to derive the spectrum slope. By applying harmonic/percussive source sepa- ration, the relationship between LTAS and percussive prominence was investigated. A clear relationship was found; tracks with more percussion have a relatively higher LTAS in the bass and high frequencies. We show how this relationship can be used to improve targets in automatic equalization. Furthermore, we assert that variations in LTAS between genres is mainly a side-effect of percussive prominence.

National Category
Computer and Information Sciences
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-208885 (URN)
Conference
AES 142nd Convention, Berlin, Germany
Note

QC 20170620

Available from: 2017-06-12 Created: 2017-06-12 Last updated: 2018-09-13Bibliographically approved
8. Algorithmic Composition of Popular Music
Open this publication in new window or tab >>Algorithmic Composition of Popular Music
2012 (English)In: Proceedings of the 12th International Conference on Music Perception and Cognition and the 8th Triennial Conference of the European Society for the Cognitive Sciences of Music / [ed] Emilios Cambouropoulos, Costas Tsourgas, Panayotis Mavromatis, Costas Pastiadis, 2012, p. 276-285Conference paper, Published paper (Refereed)
Abstract [en]

Human  composers  have  used  formal  rules  for  centuries  to  compose music, and an algorithmic composer – composing without the aid of human intervention – can be seen as an extension of this technique. An algorithmic  composer  of  popular  music  (a  computer  program)  has been  created  with  the  aim  to  get  a  better  understanding  of  how  the composition process can be formalized and at the same time to get a better  understanding  of  popular  music  in  general.  With  the  aid  of statistical  findings  a  theoretical  framework  for  relevant  methods  are presented.  The concept of Global Joint Accent Structure is introduced, as a way of understanding how melody and rhythm interact to help the listener   form   expectations  about   future   events. Methods  of  the program   are   presented   with   references   to   supporting   statistical findings. The  algorithmic  composer  creates a  rhythmic  foundation (drums), a chord progression, a phrase structure and at last the melody. The main focus has been the composition of the melody. The melodic generation  is  based  on  ten  different  musical  aspects  which  are described. The resulting output was evaluated in a formal listening test where 14  computer  compositions  were  compared  with  21  human compositions. Results indicate a slightly lower score for the computer compositions but the differences were statistically insignificant.

National Category
Computer and Information Sciences
Research subject
Speech and Music Communication
Identifiers
urn:nbn:se:kth:diva-109400 (URN)
Conference
the 12th International Conference on Music Perception and Cognition and the 8th Triennial Conference of the European Society for the Cognitive Sciences of Music
Note

QC 20130523

Available from: 2013-01-02 Created: 2013-01-02 Last updated: 2018-09-13Bibliographically approved
9. Deep Layered Learning in MIR
Open this publication in new window or tab >>Deep Layered Learning in MIR
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Deep learning has boosted the performance of many music information retrieval (MIR) systems in recent years. Yet, the complex hierarchical arrangement of music makes end-to-end learning hard for some MIR tasks – a very deep and structurally flexible processing chain is necessary to extract high-level features from a spectrogram representation. Mid-level representations such as tones, pitched onsets, chords, and beats are fundamental building blocks of music. This paper discusses how these can be used as intermediate representations in MIR to facilitate deep processing that generalizes well: each music concept is predicted individually in learning modules that are connected through latent representations in a directed acyclic graph. It is suggested that this strategy for inference, defined as deep layered learning (DLL), can help generalization by (1) – enforcing the validity of intermediate representations during processing, and by (2) – letting the inferred representations establish disentangled structures that support high-level invariant processing. A background to DLL and modular music processing is provided, and relevant concepts such as pruning, skip connections, and layered performance supervision are reviewed.

National Category
Other Computer and Information Science Computer Engineering Computer Sciences
Research subject
Computer Science; Information and Communication Technology
Identifiers
urn:nbn:se:kth:diva-226893 (URN)
Note

QC 20180427

arXiv preprint arXiv:1804.07297

Available from: 2018-04-26 Created: 2018-04-26 Last updated: 2018-04-27Bibliographically approved
10. Tempo-Invariant Processing of Rhythm with Convolutional Neural Networks
Open this publication in new window or tab >>Tempo-Invariant Processing of Rhythm with Convolutional Neural Networks
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Rhythm patterns can be performed with a wide variation of tempi. This presents a challenge for many music information retrieval (MIR) systems; ideally, perceptually similar rhythms should be represented and processed similarly, regardless of the specific tempo at which they were performed. Several recent systems for tempo estimation, beat tracking, and downbeat tracking have therefore sought to process rhythm in a tempo-invariant way, often by sampling input vectors according to a precomputed pulse level. This paper describes how a log-frequency representation of rhythm-related activations instead can promote tempo invariance when processed with convolutional neural networks. The strategy incorporates invariance at a fundamental level and can be useful for most tasks related to rhythm processing. Different methods are described, relying on magnitude, phase relationships of different rhythm channels, as well as raw phase information. Several variations are explored to provide direction for future implementations.

National Category
Other Computer and Information Science
Identifiers
urn:nbn:se:kth:diva-226892 (URN)
Note

QC 20180427

arXiv preprint arXiv:1804.08167

Available from: 2018-04-26 Created: 2018-04-26 Last updated: 2018-04-27Bibliographically approved

Open Access in DiVA

Introduction and Summary of Dissertation: Modeling Music - Anders Elowsson(1947 kB)309 downloads
File information
File name FULLTEXT01.pdfFile size 1947 kBChecksum SHA-512
beaead72263e11f89d1ba2c4b1e69acb259aa5c52d88d9b30a06ce109559c092e533f95f82fa7c60afe85dde13b600612c7b97980f4eba1463d2a897c54d8b2c
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Elowsson, Anders
By organisation
Music Acoustics
Other Computer and Information ScienceComputer EngineeringMedia and Communication Technology

Search outside of DiVA

GoogleGoogle Scholar
Total: 309 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2365 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • harvard1
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf