kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Publications (8 of 8) Show all publications
Gamba, M. (2024). On Implicit Smoothness Regularization in Deep Learning. (Doctoral dissertation). Stockholm: KTH Royal Institute of Technology
Open this publication in new window or tab >>On Implicit Smoothness Regularization in Deep Learning
2024 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

State of the art neural networks provide a rich class of function approximators,fueling the remarkable success of gradient-based deep learning on complex high-dimensional problems, ranging from natural language modeling to imageand video generation and understanding. Modern deep networks enjoy sufficient expressive power to shatter common classification benchmarks, as wellas interpolate noisy regression targets. At the same time, the same models areable to generalize well whilst perfectly fitting noisy training data, even in the absence of external regularization constraining model expressivity. Efforts towards making sense of the observed benign overfitting behaviour uncovered its occurrence in overparameterized linear regression as well as kernel regression,extending classical empirical risk minimization to the study of minimum norm interpolators. Existing theoretical understanding of the phenomenon identi-fies two key factors affecting the generalization ability of interpolating models.First, overparameterization – corresponding to the regime in which a model counts more parameters than the number of constraints imposed by the train-ing sample – effectively reduces model variance in proximity of the training data. Second, the structure of the learner – which determines how patterns in the training data are encoded in the learned representation – controls the ability to separate signal from noise when attaining interpolation. Analyzingthe above factors for deep finite-width networks respectively entails characterizing the mechanisms driving feature learning and norm-based capacity control in practical settings, thus posing a challenging open problem. The present thesis explores the problem of capturing effective complexity of finite-width deep networks trained in practice, through the lens of model function geometry, focusing on factors implicitly restricting model complexity. First,model expressivity is contrasted to effective nonlinearity for models undergoing double descent, highlighting constrained effective complexity afforded by over parameterization. Second, the geometry of interpolation is studied in the presence of noisy targets, observing robust interpolation over volumesof size controlled by model scale. Third, the observed behavior is formally tied to parameter-space curvature, connecting parameter-space geometry tothe input space’s. Finally, the thesis concludes by investigating whether the findings translate to the context of self-supervised learning, relating the geometry of representations to downstream robustness, and highlighting trends in keeping with neural scaling laws. The present work isolates input-space smoothness as a key notion for characterizing effective complexity of model functions expressed by overparameterized deep networks.

Abstract [sv]

Toppmoderna neurala nätverk erbjuder en rik klass funktionsapproximatorer,vilket stimulerar den anmärkningsvärda utvecklingen av gradientbaserad djupinlärning för komplexa högdimensionella problem, allt från modellering avnaturligt språk till bild- och videogenerering och förståelse. Moderna djupanätverk har tillräckligt mycket expressiv kraft för att kunna slå vanliga klassificeringsbenchmarks, samt interpolera brusiga regressionsmål. Samma modeller kan generalisera väl samtidigt som de kan anpassas perfekt till brusigträningsdata, även i frånvaro av extern regularisering som begränsar modellens uttrycksförmåga. Ansträngningar för att förstå det observerade så kallade benign overfitting-beteendet har påvisat dess förekomst i överparameteriserad linjär regression såväl som i kärnbaserad regression, vilket utvidgar klassisk empirisk riskminimering till studiet av miniminorm interpolatorer. Befintlig teoretisk förståelse av fenomenet identifierar två nyckelfaktorer som påverkargeneraliseringsförmågan hos interpolerande modeller. För det första reducerar överparameterisering - motsvarande regimen där en modell har fler paramet-rar än antalet villkor som ställs av träningsproven - effektivt modellvarianseni närheten av träningsdatan. För det andra styr inlärningens struktur - som bestämmer hur mönster i träningsdata kodas i den inlärda representationen- förmågan att separera signal från brus när interpolering uppnås. Att analysera ovanstående faktorer för nätverk med djup ändlig bredd innebär att karakterisera de mekanismer som driver funktionsinlärning och normbaserad kapacitetskontroll i praktiska sammanhang, vilket utgör ett utmanande öppet problem. Den föreliggande avhandlingen utforskar problemet med att fånga den effektiva komplexiteten hos djupa nätverk med ändlig bredd som tränas i praktiken, sett genom linsen av modellfunktionens geometri, med fokus på faktorer som implicit begränsar modellens komplexitet. För det första kontrasteras modellexpressivitet till effektiv olinjäritet för modeller som genomgår så kallad double descent, vilket framhäver begränsad effektiv komplexitet som ges av överparameterisering. För det andra studeras interpolationens geometri i närvaro av brusiga mål, och observerar robust interpolation över volymer av storlekar bestämda av modellskalan. För det tredje kopplas det observerade beteendet formellt till parameter-rymdens krökning, vilket kopplar parameterrymdens geometri till in datarymdens. Slutligen avslutas avhand-lingen med att undersöka huruvida resultaten kan översättas till kontexten av självövervakad inlärning, relaterar representationernas geometri till nedströms robusthet, och belyser trender i linje med neurala skalningslagar. Det föreliggande arbetet isolerar indatarymdens jämnhet som ett nyckelbegrepp för att karakterisera effektiv komplexitet hos modellfunktioner uttryckta av överparameteriserade djupa nätverk.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2024. p. 94
Series
TRITA-EECS-AVL ; 2024:80
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-354917 (URN)978-91-8106-077-5 (ISBN)
Public defence
2024-11-07, https://kth-se.zoom.us/j/62717697317, Kollegiesalen, Brinellvägen 6, Stockholm, 15:00 (English)
Opponent
Supervisors
Note

QC 20241017

Available from: 2024-10-17 Created: 2024-10-17 Last updated: 2025-12-03Bibliographically approved
Gamba, M., Englesson, E., Björkman, M. & Azizpour, H. (2023). Deep Double Descent via Smooth Interpolation. Transactions on Machine Learning Research, 2023(4)
Open this publication in new window or tab >>Deep Double Descent via Smooth Interpolation
2023 (English)In: Transactions on Machine Learning Research, E-ISSN 2835-8856, Vol. 2023, no 4Article in journal (Refereed) Published
Abstract [en]

The ability of overparameterized deep networks to interpolate noisy data, while at the same time showing good generalization performance, has been recently characterized in terms of the double descent curve for the test error. Common intuition from polynomial regression suggests that overparameterized networks are able to sharply interpolate noisy data, without considerably deviating from the ground-truth signal, thus preserving generalization ability. At present, a precise characterization of the relationship between interpolation and generalization for deep networks is missing. In this work, we quantify sharpness of fit of the training data interpolated by neural network functions, by studying the loss landscape w.r.t. to the input variable locally to each training point, over volumes around cleanly- and noisily-labelled training samples, as we systematically increase the number of model parameters and training epochs. Our findings show that loss sharpness in the input space follows both model- and epoch-wise double descent, with worse peaks observed around noisy labels. While small interpolating models sharply fit both clean and noisy data, large interpolating models express a smooth loss landscape, where noisy targets are predicted over large volumes around training data points, in contrast to existing intuition.

Place, publisher, year, edition, pages
Transactions on Machine Learning Research (TMLR), 2023
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-346450 (URN)2-s2.0-86000152632 (Scopus ID)
Note

QC 20250320

Available from: 2024-05-15 Created: 2024-05-15 Last updated: 2025-03-20Bibliographically approved
Gamba, M., Azizpour, H. & Björkman, M. (2023). On the Lipschitz Constant of Deep Networks and Double Descent. In: BMVA (Ed.), Proceedings 34th British Machine Vision Conference 2023: . Paper presented at 34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023. Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>On the Lipschitz Constant of Deep Networks and Double Descent
2023 (English)In: Proceedings 34th British Machine Vision Conference 2023 / [ed] BMVA, Institute of Electrical and Electronics Engineers (IEEE) , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors - namely loss landscape curvature and distance of parameters from initialization - respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novel insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2023
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-348454 (URN)10.1109/cdc51059.2022.9993136 (DOI)000948128102130 ()2-s2.0-85147000098 (Scopus ID)
Conference
34th British Machine Vision Conference 2023, {BMVC} 2023, Aberdeen, UK, November 20-24, 2023
Note

Part of ISBN 978-166546761-2

QC 20240709

Available from: 2024-07-05 Created: 2024-07-05 Last updated: 2026-02-19Bibliographically approved
Gamba, M., Azizpour, H. & Björkman, M. (2023). On the Lipschitz Constant of Deep Networks and Double Descent. In: 34th British Machine Vision Conference, BMVC 2023: . Paper presented at 34th British Machine Vision Conference, BMVC 2023, Aberdeen, United Kingdom of Great Britain and Northern Ireland, Nov 20 2023 - Nov 24 2023. British Machine Vision Association, BMVA
Open this publication in new window or tab >>On the Lipschitz Constant of Deep Networks and Double Descent
2023 (English)In: 34th British Machine Vision Conference, BMVC 2023, British Machine Vision Association, BMVA , 2023Conference paper, Published paper (Refereed)
Abstract [en]

Existing bounds on the generalization error of deep networks assume some form of smooth or bounded dependence on the input variable, falling short of investigating the mechanisms controlling such factors in practice. In this work, we present an extensive experimental study of the empirical Lipschitz constant of deep networks undergoing double descent, and highlight non-monotonic trends strongly correlating with the test error. Building a connection between parameter-space and input-space gradients for SGD around a critical point, we isolate two important factors - namely loss landscape curvature and distance of parameters from initialization - respectively controlling optimization dynamics around a critical point and bounding model function complexity, even beyond the training data. Our study presents novel insights on implicit regularization via overparameterization, and effective model complexity for networks trained in practice.

Place, publisher, year, edition, pages
British Machine Vision Association, BMVA, 2023
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-377498 (URN)2-s2.0-105029482985 (Scopus ID)
Conference
34th British Machine Vision Conference, BMVC 2023, Aberdeen, United Kingdom of Great Britain and Northern Ireland, Nov 20 2023 - Nov 24 2023
Note

Not duplicate with diva 1882587

QC 20260302

Available from: 2026-03-02 Created: 2026-03-02 Last updated: 2026-03-02Bibliographically approved
Gamba, M., Chmielewski-Anders, A., Sullivan, J., Azizpour, H. & Björkman, M. (2022). Are All Linear Regions Created Equal?. In: Camps-Valls, G Ruiz, FJR Valera, I (Ed.), Proceedings 25th International Conference on Artificial Intelligence and Statistics, AISTATS 2022: . Paper presented at 25th International Conference on Artificial Intelligence and Statistics, AISTATS 2022, Virtual, Online, MAR 28-30, 2022. ML Research Press, 151
Open this publication in new window or tab >>Are All Linear Regions Created Equal?
Show others...
2022 (English)In: Proceedings 25th International Conference on Artificial Intelligence and Statistics, AISTATS 2022 / [ed] Camps-Valls, G Ruiz, FJR Valera, I, ML Research Press , 2022, Vol. 151Conference paper, Published paper (Refereed)
Abstract [en]

The number of linear regions has been studied as a proxy of complexity for ReLU networks. However, the empirical success of network compression techniques like pruning and knowledge distillation, suggest that in the overparameterized setting, linear regions density might fail to capture the effective nonlinearity. In this work, we propose an efficient algorithm for discovering linear regions and use it to investigate the effectiveness of density in capturing the nonlinearity of trained VGGs and ResNets on CIFAR-10 and CIFAR-100. We contrast the results with a more principled nonlinearity measure based on function variation, highlighting the shortcomings of linear regions density. Furthermore, interestingly, our measure of nonlinearity clearly correlates with model-wise deep double descent, connecting reduced test error with reduced nonlinearity, and increased local similarity of linear regions.

Place, publisher, year, edition, pages
ML Research Press, 2022
Series
Proceedings of Machine Learning Research, ISSN 2640-3498
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:kth:diva-320995 (URN)000841852301002 ()2-s2.0-85163053252 (Scopus ID)
Conference
25th International Conference on Artificial Intelligence and Statistics, AISTATS 2022, Virtual, Online, MAR 28-30, 2022
Note

QC 20221104

Available from: 2022-11-04 Created: 2022-11-04 Last updated: 2024-10-17Bibliographically approved
Gamba, M., Azizpour, H., Carlsson, S. & Björkman, M. (2019). On the geometry of rectifier convolutional neural networks. In: Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019: . Paper presented at 17th IEEE/CVF International Conference on Computer Vision Workshop, ICCVW 2019, 27 October 2019 through 28 October 2019 (pp. 793-797). Institute of Electrical and Electronics Engineers Inc.
Open this publication in new window or tab >>On the geometry of rectifier convolutional neural networks
2019 (English)In: Proceedings - 2019 International Conference on Computer Vision Workshop, ICCVW 2019, Institute of Electrical and Electronics Engineers Inc. , 2019, p. 793-797Conference paper, Published paper (Refereed)
Abstract [en]

While recent studies have shed light on the expressivity, complexity and compositionality of convolutional networks, the real inductive bias of the family of functions reachable by gradient descent on natural data is still unknown. By exploiting symmetries in the preactivation space of convolutional layers, we present preliminary empirical evidence of regularities in the preimage of trained rectifier networks, in terms of arrangements of polytopes, and relate it to the nonlinear transformations applied by the network to its input.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers Inc., 2019
Keywords
Convolutional networks, Deep learning, Heometry, Preimage, Understanding, Computer vision, Convolution, Gradient methods, Rectifying circuits, Compositionality, Gradient descent, Inductive bias, Non-linear transformations, Pre images, Convolutional neural networks
National Category
Robotics and automation
Identifiers
urn:nbn:se:kth:diva-274163 (URN)10.1109/ICCVW.2019.00106 (DOI)000554591600099 ()2-s2.0-85082492932 (Scopus ID)
Conference
17th IEEE/CVF International Conference on Computer Vision Workshop, ICCVW 2019, 27 October 2019 through 28 October 2019
Note

QC 20200622

Part of ISBN 9781728150239

Available from: 2020-06-22 Created: 2020-06-22 Last updated: 2025-02-09Bibliographically approved
Gamba, M., Ghosh, A., Agrawal, K. K., Richards, B., Azizpour, H. & Björkman, M.Different Faces of Model Scaling in Supervised and Self-Supervised Learning.
Open this publication in new window or tab >>Different Faces of Model Scaling in Supervised and Self-Supervised Learning
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

The quality of the representations learned by neural networks depends on several factors, including the loss function, learning algorithm, and model architecture. In this work, we use information geometric measures to assess the representation quality in a principled manner. We demonstrate that the sensitivity of learned representations to input perturbations, measured by the spectral norm of the feature Jacobian, provides valuable information about downstream generalization. On the other hand, measuring the coefficient of spectral decay observed in the eigenspectrum of feature covariance provides insights into the global representation geometry. First, we empirically establish an equivalence between these notions of representation quality and show that they are inversely correlated. Second, our analysis reveals varying roles of scaling model size in improving generalization. Increasing model width leads to higher discriminability and relatively reduced smoothness in the self-supervised regime, compatibly with the underparameterized regime of supervised learning. Interestingly, we report no observable double descent phenomenon in SSL with non-contrastive objectives for commonly used parameterization regimes, which opens up new opportunities for tight asymptotic analysis. Taken together, our results provide a loss-aware characterization of the different role of model scaling in supervised and self-supervised learning.

National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-354880 (URN)
Note

QC 20241016

Available from: 2024-10-16 Created: 2024-10-16 Last updated: 2024-10-17Bibliographically approved
Gamba, M., Agrawal, K. K., Ghosh, A., Richards, B., Azizpour, H. & Björkman, M.When Does Self-Supervised Pre-Training Yield Robust Representations?.
Open this publication in new window or tab >>When Does Self-Supervised Pre-Training Yield Robust Representations?
Show others...
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Self-Supervised Learning (SSL) provides a powerful class of learning algorithmsfor extracting representations of unlabelled data. A common learning paradigm relies on generating multiple views of the training data by perturbing inputs withdata augmentation, effectively enforcing the representation to attain invariance tocertain input perturbations. While encoding invariance in this way has been shownto reliably improve downstream performance, its impact on Out of Distribution(OOD) generalization is underexplored. In particular, invariance-based learning ob-jectives enforce low feature variation under selected input perturbations, which is a fundamental desideratum when dealing with downstream distribution shifts. Build-ing on this connection, this work explores OOD robustness of SSL representationswhen data is corrupted with noise of increasing intensity, under different model scales and dataset sizes. Strikingly, our experiments suggest that, for fixed trainingset, increasing encoder capacity consistently improves in-distribution performance,whereas OOD robustness plateaus. Furthermore, increasing training set size either virtually (via data augmentation) or by increasing the number of unperturbed samples improves OOD robustness across all model scales, delaying the onset ofthe plateau. While increasing dataset size with unperturbed samples consistently improves downstream performance as well as robustness, data augmentation in the low-samples regime offers a strong alternative when acquiring unperturbed data is impractical.

National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-354881 (URN)
Note

QC 20241023

Available from: 2024-10-16 Created: 2024-10-16 Last updated: 2024-10-23Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-0242-4419

Search in DiVA

Show all publications