kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
On large deviations in probabilistic deep learning and generative modeling
KTH, School of Engineering Sciences (SCI), Mathematics (Dept.), Probability, Mathematical Physics and Statistics.ORCID iD: 0000-0001-5740-5103
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The topic of this thesis is the use of probabilistic methods in machine learning. These play a foundational role in motivating and defining machine learning algorithms, as well as in explaining why, and how well, the algorithms work. During the years in which this work was conducted (2020-2025), machine learning has gone from delivering a handful of impressive demonstrable results to becoming a staple of modern developed society, with products such as ChatGPT, etc., having capacity and applicability far beyond what anyone in the field expected ten years ago. Explaining why the modern methods work so well, despite their conceptual simplicity and elegance, requires both empirical and theoretical studies. This thesis has both parts, while the emphasis is on theory. The first part of the thesis, Papers A-C, concerns the implementation and analysis of novel methodologies in deep learning, whereas Papers D-F concern purely theoretical large deviations results for machine learning adjacent models. The main thread is the application of mathematical tools from probability theory and statistics, such as the theory of large deviations and empirical process theory, to the understanding and improvement of machine learning methodology.

In Paper A, which presents the most applied direction pursued in the thesis, a deep probabilistic network model is applied to a task from the field of clinical radiation therapy, namely dose prediction, where a value of radiation dose target shall be assigned to each pixel/voxel of a human tissue based on a CT-image. The developed probabilistic model is based on mixture density networks. It is empirically demonstrated that a convolutional U-net can learn a satisfactory mixture distribution in each pixel. To the best of our knowledge, this is the first implementation of mixture density networks working on images with a convolutional architecture.

In Paper B, we construct and implement a new method called REMEDI for entropy estimation of continuous distributions using deep neural networks. Further, using empirical process theory, it is shown that such an estimator has a consistency property, ensuring that it has the theoretical capacity to estimate the entropy to arbitrary precision. The method is based on the celebrated Donsker-Varadhan lemma, a well-known fact from the theory of large deviations. The applicability of the method is demonstrated on distributions in moderate dimension, as well as on the task of training model in the information bottleneck framework, with satisfactory performance.

In Paper C, we explore the integration of non-parametric model components into the flow-matching framework. This is done by learning a heavily compressed latent representation of images in the training dataset, which are then used as conditioning variables for the vector field network. Effectively, these can be seen as synthetic, continuous labels. The gain is a more efficient learning process, compared to baseline models, and more interpretable sampling. It is demonstrated that with sufficient compression, overfitting can be avoided, and diversity among samples attained, despite conditioning on training samples.

Arguably, the most impactful area in machine learning is generative modeling. One part of the thesis deals with applying the theory of large deviations to two of its major methods, generative adversarial networks (GAN), and diffusion models, in particular Schrödinger bridges. The goal here is to prove large deviation principles, for certain sequences of probability measures associated with the models. In both cases, this allows a strong characterization of the convergence of these models, under varying certain model parameters, toward an idealized description of their behavior, often carrying a well-understood mathematical structure. In Schrödinger bridges, this idealized limit model consists of a dynamical optimal transport plan. This tells us that when varying the parameter in question, which is the reference noise level, or, as is often equivalent, the level of entropic regularization, the plans converge rapidly toward optimal transport behavior, justifying the interpretation of weakly regularized Schrödinger bridges as approximate optimal transport plans. Since Schrödinger bridges (or entropically regularized optimal transport plans) have nicer computational properties than optimal transport, they are often used in its place, and it is therefore important to understand how close this connection is. The large deviation principles derived here, applicable to several popular deep generative models, thus contribute to this understanding. Stating and proving such large deviations results are the contents of Papers E and F.

For GANs, we consider recently developed particle systems making up cohorts of networks for the generative task, in Paper D. Here, we show that when the parameter is taken to be the number of particles in this system, the training dynamics of these networks converge toward a McKean-Vlasov process, and a large deviation principle is established. This enables the study of the convergence of such particle systems, which have recently been proposed as a new generative model in the GAN literature, toward their mean-field behavior using the theory of large deviations.

Abstract [sv]

Denna avhandling avser användandet av probabilistiska metoder inom maskininlärnining. Dessa spelar en grundläggande roll i att motivera och definiera maskininlärningsalgoritmer, såväl som att förklara varför dessa algoritmer fungerar, och hur väl de fungerar. Under åren som detta arbete utfördes (2020-2025) har maskininlärning gått från att leverera en handfull av imponerande och demonstrerbara resultat, till att bli en stapelvara av det moderna utvecklade samhället, med produkter som ChatGPT, etc., med kapacitet och tillämpbarhet bortom vad någon i fältet hade förväntat sig för tio år sedan. Att förklara varför de moderna metoderna fungerar såväl, trots deras konceptuella simplicitet och elegans, kräver både empiriska och teoretiska studier. Denna avhandling har båda delar, medan betoningen är på teori. Den första delen av avhandlingen, Artiklar A-C, berör implementation och analys av nya metodologier i djupinlärning, medan Artiklar D-F berör rent teoretiska stora avvikelse-resultat för modeller som ligger nära maskininlärning. Den huvudsakliga tråden är tillämpningen av matematiska verktyg från sannolikhetsteori och statistik, såsom teorin för stora avvikelser och empirisk processteori, till att förstå och förbättra metodologi inom maskininlärning.

I Artikel A, vilken utgör den mest tillämpade riktningen i denna avhandling, appliceras en probabilistisk djup nätverksmodell till en uppgift från fältet av klinisk strålningsterapi, nämligen dosprediktion, där ett målvärde av strålningsdos ska tilldelas varje pixel/voxel av en mänsklig vävnad baserat på en CT-bild. Den utvecklade probabilistiska modellen är baserad på mixturdensitetsnätverk. Empiriskt demonstreras att ett faltningsbaserat U-net kan lära sig en tillfredsställande mixturdistribution över dosen i varje pixel. Så vitt vi vet är detta den första implementationen av mixturdensitetsnätverk som verkar på bilder med en faltningsarkitektur.

I Artikel B, konstrueras och implementeras en ny metod kallad REMEDI för entropiestimering av kontinuerliga distributioner med hjälp av djupa neurala nätverk.    Dessutom visas, med hjälp av empirisk processteori, att en sådan estimator har en konsistensegenskap, vilket försäkrar oss om att den har teoretisk kapacitet att estimera entropin till godtycklig precision. Metoden baseras på det bejublade Donsker-Varadhan-lemmat, ett välkänt resultat ifrån teorin för stora avvikelser. Tillämpbarheten av metoden demonstreras på distributioner i moderat dimension, såväl som för modellträning inom informations-flaskhalsramverket, med tillfredsställande prestanda.

I Artikel C utforskas integrationen av icke-parametriska modellkomponenter i flödes-matchningsramverket. Detta görs genom a lära sig en tungt komprimerad latent representation av bilder i träningsdatamängden, som sedan används som betingningsvariabler för vektorfältsnätverket. Dessa kan effektivt betraktas som syntetiska, kontinuerliga klassvariabler. Vinsten är en mer effektiv inlärningsprocess jämför med baslinjemodeller, samt mer tolkningsbar slumpgenerering. Det demonstreras att, med tillräcklig komprimering, kan överträning undvikas, och mångfald bland slumpgenererade exempel uppnås, trots att betingningen på träningsexempel.

Det kan argumenteras för att den mest betydelsefulla fältet inom maskininlärning är generativ modellering. En del av denna avhandling handlar om att tillämpa teorin för stora avvikelser på två av dess huvudsakliga metoder, generativa motstående nätverk (GAN) och diffusionsmodeller, särskilt Schrödingerbroar. Målet här är att bevisa stora avvikelseprinciper för särskilda sannolikhetsmått associerade med modellerna. I båda fallen tillåter detta en stark karaktärisering av konvergensen a dessa modeller, när särskilda modellparametrar varieras, mot en idealiserad beskrivning av deras beteende, som ofta bär en välförstådd matematisk struktur.

Inom Schrödingerbroar består denna idealiserade gränsmodell av en dynamisk optimal transportplan. Detta säger oss att när parametern i fråga varieras, vilken är brusnivån av referensen, eller som ofta är ekvivalent, nivån av entropisk regularisering, konvergerar planerna snabbt mot optimal transportbeteende, vilket rättfärdigar tolkningen av svagt regulariserade Schrödingerbroar som approximativa optimala transportplaner. Eftersom Schrödingerbroar (eller entropiskt regulariserade optimala transportplaner) har trevligare beräkningsmässiga egenskaper än optimal transport, används de ofta i dess plats, och det är därför viktigt att förstå hur nära denna koppling är. Den härledda stora avvikelseprincipen, som är tillämpbar på många populära djupa generativa modeller, bidrar därför till denna förståelse. Att formulera och bevisa sådana stora avvikelseprinciper är innehållet i Artikel E och Artikel F.

För GAN-modeller betraktar vi nyligen utvecklade partikelsystem, som utgör kohorter av nätverk för den generativa uppgiften, i Artikel D. Här visar vi att när parametern tas till att vara antalet partiklar i systemet, konvergerar träningsdynamiken av dessa nätverk mot en McKean-Vlasov process, och en stor avvikeleprincip etableras. Detta möjliggör studien av konvergensen för sådana partikelsystem, som nyligen har framlagts som en ny generativ model i GAN-litteraturen, mot deras medelfältsteoretiska beteende med hjälp av teorin för stora avvikelser.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025.
Series
TRITA-SCI-FOU ; 2025:59
Keywords [en]
Large deviations, Generative models, Schrödinger bridges, Optimal transport, Machine learning
Keywords [sv]
Stora avvikelser, Generativa modeller, Schrödingerbroar, Optimal transport, Maskininlärning
National Category
Probability Theory and Statistics
Research subject
Mathematics
Identifiers
URN: urn:nbn:se:kth:diva-373185ISBN: 978-91-8106-438-4 (print)OAI: oai:DiVA.org:kth-373185DiVA, id: diva2:2015506
Public defence
2025-12-11, Kollegiesalen, Brinellvägen 8, Stockholm, 14:00 (English)
Opponent
Supervisors
Funder
Wallenberg AI, Autonomous Systems and Software Program (WASP), 67105
Note

QC 2025-11-21

Available from: 2025-11-21 Created: 2025-11-21 Last updated: 2025-12-02Bibliographically approved
List of papers
1. Probabilistic dose prediction using mixture density networks for automated radiation therapy treatment planning
Open this publication in new window or tab >>Probabilistic dose prediction using mixture density networks for automated radiation therapy treatment planning
Show others...
2021 (English)In: Physics in Medicine and Biology, ISSN 0031-9155, E-ISSN 1361-6560, Vol. 66, no 5, article id 055003Article in journal (Refereed) Published
Abstract [en]

We demonstrate the application of mixture density networks (MDNs) in the context of automated radiation therapy treatment planning. It is shown that an MDN can produce good predictions of dose distributions as well as reflect uncertain decision making associated with inherently conflicting clinical tradeoffs, in contrast to deterministic methods previously investigated in the literature. A two-component Gaussian MDN is trained on a set of treatment plans for postoperative prostate patients with varying extents to which rectum dose sparing was prioritized over target coverage. Examination on a test set of patients shows that the predicted modes follow their respective ground truths well, both spatially and in terms of their dose-volume histograms. A special dose mimicking method based on the MDN output is used to produce deliverable plans and thereby showcase the usability of voxel-wise predictive densities. Thus, this type of MDN may serve to support clinicians in managing clinical tradeoffs and has the potential to improve the quality of plans produced by an automated treatment planning pipeline.

Place, publisher, year, edition, pages
Institute of Physics (IOP), 2021
Keywords
mixture density network, dose prediction, dose mimicking, knowledge-based planning, deep learning, radiation therapy treatment planning
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-291992 (URN)10.1088/1361-6560/abdd8a (DOI)000618026500001 ()33470973 (PubMedID)2-s2.0-85101304527 (Scopus ID)
Note

QC 20210329

Available from: 2021-03-29 Created: 2021-03-29 Last updated: 2025-11-21Bibliographically approved
2. REMEDI: Corrective Transformations for Improved Neural Entropy Estimation
Open this publication in new window or tab >>REMEDI: Corrective Transformations for Improved Neural Entropy Estimation
2024 (English)In: International Conference on Machine Learning, ICML 2024, ML Research Press , 2024, p. 38207-38236Conference paper, Published paper (Refereed)
Abstract [en]

Information theoretic quantities play a central role in machine learning. The recent surge in the complexity of data and models has increased the demand for accurate estimation of these quantities. However, as the dimension grows the estimation presents significant challenges, with existing methods struggling already in relatively low dimensions. To address this issue, in this work, we introduce REMEDI for efficient and accurate estimation of differential entropy, a fundamental information theoretic quantity. The approach combines the minimization of the cross-entropy for simple, adaptive base models and the estimation of their deviation, in terms of the relative entropy, from the data density. Our approach demonstrates improvement across a broad spectrum of estimation tasks, encompassing entropy estimation on both synthetic and natural data. Further, we extend important theoretical consistency results to a more generalized setting required by our approach. We illustrate how the framework can be naturally extended to information theoretic supervised learning models, with a specific focus on the Information Bottleneck approach. It is demonstrated that the method delivers better accuracy compared to the existing methods in Information Bottleneck. In addition, we explore a natural connection between REMEDI and generative modeling using rejection sampling and Langevin dynamics.

Place, publisher, year, edition, pages
ML Research Press, 2024
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-353945 (URN)2-s2.0-85203821749 (Scopus ID)
Conference
41st International Conference on Machine Learning, ICML 2024, Vienna, Austria, Jul 21 2024 - Jul 27 2024
Note

QC 20240926

Available from: 2024-09-25 Created: 2024-09-25 Last updated: 2025-11-21Bibliographically approved
3. Efficient Flow Matching using Latent Variables
Open this publication in new window or tab >>Efficient Flow Matching using Latent Variables
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present Latent-CFM, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that Latent-CFM exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process. 

National Category
Computer graphics and computer vision
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-373175 (URN)10.48550/arXiv.2505.04486 (DOI)
Note

QC 20251124

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-24Bibliographically approved
4. Large deviations for interacting particle dynamics for finding mixed equilibria in zero-sum games
Open this publication in new window or tab >>Large deviations for interacting particle dynamics for finding mixed equilibria in zero-sum games
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Finding equilibrium points in continuous minmax games has become a key problem within machine learning, in part due to its connection to the training of generative adversarial networks and reinforcement learning. Because of existence and robustness issues, recent developments have shifted from pure equilibria to focusing on mixed equilibrium points. In this work we consider a method for finding mixed equilibria in two-layer zero-sum games based on entropic regularisation, where the two competing strategies are represented by two sets of interacting particles. We show that the sequence of empirical measures of the particle system satisfies a large deviation principle as the number of particles grows to infinity, and how this implies convergence of the empirical measure and the associated Nikaidô-Isoda error, complementing existing law of large numbers results. 

National Category
Probability Theory and Statistics
Research subject
Mathematics
Identifiers
urn:nbn:se:kth:diva-373182 (URN)10.48550/arXiv.2206.15177 (DOI)
Note

QC 20251124

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-24Bibliographically approved
5. Large deviations for scaled families of Schrödinger bridges with reflection
Open this publication in new window or tab >>Large deviations for scaled families of Schrödinger bridges with reflection
(English)Manuscript (preprint) (Other academic)
Abstract [en]

In this paper, we show a large deviation principle for certain sequences of static Schrödinger bridges, typically motivated by a scale-parameter decreasing towards zero, extending existing large deviation results to cover a wider range of reference processes. Our results provide a theoretical foundation for studying convergence of such Schrödinger bridges to their limiting optimal transport plans. Within generative modeling, Schrödinger bridges, or entropic optimal transport problems, constitute a prominent class of methods, in part because of their computational feasibility in high-dimensional settings. Recently, Bernton et al. established a large deviation principle, in the small-noise limit, for fixed-cost entropic optimal transport problems. In this paper, we address an open problem posed by Bernton et al. and extend their results to hold for Schrödinger bridges associated with certain sequences of more general reference measures with enough regularity in a similar small-noise limit. These can be viewed as sequences of entropic optimal transport plans with non-fixed cost functions. Using a detailed analysis of the associated Skorokhod maps and transition densities, we show that the new large deviation results cover Schrödinger bridges where the reference process is a reflected diffusion on bounded convex domains, corresponding to recently introduced model choices in the generative modeling literature. 

National Category
Probability Theory and Statistics
Research subject
Mathematics
Identifiers
urn:nbn:se:kth:diva-373183 (URN)10.48550/arXiv.2506.03999 (DOI)
Note

QC 20251124

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-24Bibliographically approved
6. A weak convergence approach to the large deviations of the dynamic Schrödinger problem
Open this publication in new window or tab >>A weak convergence approach to the large deviations of the dynamic Schrödinger problem
(English)Manuscript (preprint) (Other academic)
Abstract [en]

In this paper, we consider the large deviations for dynamical Schrödinger problems, using the variational approach developed by Dupuis, Ellis, Budhiraja, and others. Recent results on scaled families of Schrödinger problems, in particular by Bernton, Ghosal, and Nutz, and the authors, have established large deviation principles for the static problem. For the dynamic problem, only the case with a scaled Brownian motion reference process has been explored by Kato.

Here, we derive large deviations results using the variational approach, with the aim of going beyond the Brownian reference dynamics considered by Kato. Specifically, we develop a uniform Laplace principle for bridge processes conditioned on their endpoints. When combined with existing results for the static problem, this leads to a large deviation principle for the corresponding (dynamic) Schrödinger bridge. In addition to the specific results of the paper, our work puts such large deviation questions into the weak convergence framework, and we conjecture that the results can be extended to cover also more involved types of reference dynamics. Specifically, we provide an outlook on applying the result to reflected Schrödinger bridges.

National Category
Probability Theory and Statistics
Research subject
Mathematics
Identifiers
urn:nbn:se:kth:diva-373184 (URN)10.48550/arXiv.2511.14757 (DOI)
Note

QC 20251124

Available from: 2025-11-20 Created: 2025-11-20 Last updated: 2025-11-24Bibliographically approved

Open Access in DiVA

Kappa(1270 kB)75 downloads
File information
File name FULLTEXT02.pdfFile size 1270 kBChecksum SHA-512
c7836abc03f6ade7909b42955106c818fc5abd7825a64137028012e846d8ddd4f43d1fbe17f4c193cb91a5e2b231966b8b19bdcf54ba68ae358585043566a59c
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Nilsson, Viktor
By organisation
Probability, Mathematical Physics and Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 75 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1338 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf