kth.sePublications KTH
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Tools and Methods for Distributed and Large-Scale Training of Deep Neural Networks
KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Software and Computer systems, SCS.ORCID iD: 0000-0001-7236-4637
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Deep Neural Networks (DNNs) have been at the forefront of recent breakthroughs in Machine Learning (ML) and Deep Learning (DL). DNNs are increasingly used in various tasks, from Earth observation and analysis of satellite images to medical diagnosis and smart chatbots. A major contributor to these advances has been the abundance of training data, computation resources, and frameworks that enable efficient training of ever-larger and more complex DNNs, within a paradigm referred to as distributed DL, and in particular, distributed training, which is the focus of this doctoral dissertation. In distributed training, the data and computation are distributed across several workers as opposed to single-host training in which both the data and computation reside and happen on a single worker. In this setting, distributed training can help overcome the limitations of single-host training, such as memory constraints, computational bottlenecks, and data availability.

However, distributed training comes with a number of challenges that need to be carefully addressed in order to have a system that efficiently makes use of it. These challenges include, but are not limited to, efficient distribution of computation and data across the workers, the presence of straggler workers in a cluster (workers that get significantly behind in their computation step compared to the other workers), especially in synchronous execution settings, and communication and synchronization among the workers. This implies that the system should provide scalability in both the computation and the data dimensions.

On the other hand, from a programming and usability point of view, using the distributed training paradigm typically requires knowledge of distributed computing principles and experience with distributed and data-intensive computing frameworks as well as applying major changes to the code used for single-host training. Furthermore, as training a DNN involves several steps and stages (e.g., data preparation, hyperparameter tuning, model training, etc.), it would be desirable to possibly reuse the computational results of different steps in each other (e.g., reusing weights learned during hyperparameter tuning trials, for weight initialization of the model training step) in order to improve training time. Finally, when developing larger and more complex DNNs, we also need to know about each design choice's contributions.

The contributions of this doctoral dissertation address the aforementioned challenges, and collectively optimize large-scale DNN training, making it more accessible, efficient, and computationally sustainable while reducing the redundancy in ML/DL workflows, and providing usable tools for conducting ablation studies. 

Abstract [sv]

Deepa neurala nätverk (DNNs) har varit i framkant av de senaste genombrotten inom maskininlärning (ML) och djupinlärning (DL). DNN används i allt större utsträckning inom en rad olika områden, från jordobservation och analys av satellitbilder till medicinsk diagnostik och smarta chattbotar. En stor bidragande faktor till dessa framsteg är tillgången på stora mängder träningsdata, kraftfulla beräkningsresurser och ramverk som möjliggör effektiv träning av allt större och mer komplexa DNNs inom ett paradigm som kallas distribuerad DL. Inom detta område är distribuerad träning fokus för denna doctoral dissertation. I distribuerad träning fördelas data och beräkningar över flera arbetarnoder, till skillnad från träning på en enskild värd där både data och beräkningar hanteras av en enda nod. I denna kontext kan distribuerad träning bidra till att övervinna begränsningar såsom minnesbegränsningar, beräkningsflaskhalsar och begränsad datatillgång.

Distribuerad träning innebär dock flera utmaningar som måste hanteras noggrant för att säkerställa effektiv resursanvändning. Dessa utmaningar inkluderar, men är inte begränsade till, effektiv fördelning av beräkningar och data mellan noder, förekomsten av stragglers (arbetarnoder som hamnar efter i sina beräkningar jämfört med andra), särskilt i synkrona exekveringsmiljöer, samt kommunikation och synkronisering mellan noderna. För att systemet ska vara skalbart behöver det kunna hantera både ökande beräkningsbehov och större datamängder.

Ur ett programmerings- och användbarhetsperspektiv kräver distribuerad träning ofta djupgående kunskap om distribuerad beräkning och erfarenhet av dataintensiva ramverk. Dessutom innebär det ofta omfattande anpassningar av kod som används för träning på en enskild värd. Eftersom träning av en DNN innefattar flera steg och faser (t.ex. datapreparering, hyperparametertuning, modellträning etc.), vore det önskvärt att återanvända beräkningsresultat från olika steg (t.ex. vikter inlärda under hyperparametertuning för att initialisera modellträningen) för att förbättra träningseffektiviteten. Slutligen, vid utveckling av större och mer komplexa DNNs, är det också viktigt att förstå varje designvals inverkan.

Denna doctoral dissertation adresserar de ovan nämnda utmaningarna och optimerar storskalig DNN-träning genom att göra den mer tillgänglig, effektiv och beräkningsmässigt hållbar, samtidigt som redundansen i ML/DL-arbetsflöden minskas och användbara verktyg för ablationsstudier tillhandahålls.

Place, publisher, year, edition, pages
Stockholm: KTH Royal Institute of Technology, 2025. , p. 47
Series
TRITA-EECS-AVL ; 2025:28
Keywords [en]
Distributed Deep Learning, Ablation Studies, Data-parallel Training, Deep Neural Networks, Systems for Machine Learning, Weight Initialization, Hyperparameter Optimization
Keywords [sv]
Distribuerad djupinlärning, Ablationsstudier, Dataparallell träning, Djupa neurala nätverk, System för maskininlärning, Viktinitialisering, Hyperparameteroptimering
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:kth:diva-360720ISBN: 978-91-8106-214-4 (print)OAI: oai:DiVA.org:kth-360720DiVA, id: diva2:1942157
Public defence
2025-03-27, Zoom: https://kth-se.zoom.us/j/69403203069, Sal-A, Electrum, Kistagången 16, Stockholm, 09:00 (English)
Opponent
Supervisors
Funder
EU, Horizon 2020, 825258Vinnova, 2016–05193
Note

QC 20250304

Available from: 2025-03-04 Created: 2025-03-04 Last updated: 2025-03-10Bibliographically approved
List of papers
1. Towards Distribution Transparency for Supervised ML With Oblivious Training Functions
Open this publication in new window or tab >>Towards Distribution Transparency for Supervised ML With Oblivious Training Functions
Show others...
2020 (English)Conference paper, Poster (with or without abstract) (Refereed)
Abstract [en]

Building and productionizing Machine Learning (ML) models is a process of interdependent steps of iterative code updates, including exploratory model design, hyperparameter tuning, ablation experiments, and model training. Industrial-strength ML involves doing this at scale, using many compute resources, and this requires rewriting the training code to account for distribution. The result is that moving from a single host program to a cluster hinders iterative development of the software, as iterative development would require multiple versions of the software to be maintained and kept consistent. In this paper, we introduce the distribution oblivious training function as an abstraction for ML development in Python, whereby developers can reuse the same training function when running a notebook on a laptop or performing scale-out hyperparameter search and distributed training on clusters. Programs written in our framework look like industry-standard ML programs as we factor out dependencies using best-practice programming idioms (such as functions to generate models and data batches). We believe that our approach takes a step towards unifying single-host and distributed ML development.

National Category
Computer Sciences Networked, Parallel and Distributed Computing
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-360717 (URN)
Conference
Workshop on MLOps Systems, co-located with the Third Conference on Machine Learning and Systems (MLSys), March 2-4, 2020, Austin, TX, USA
Funder
EU, Horizon 2020, 825258
Note

QCR 20250303

Available from: 2025-02-28 Created: 2025-02-28 Last updated: 2025-03-04Bibliographically approved
2. Maggy: Scalable Asynchronous Parallel Hyperparameter Search
Open this publication in new window or tab >>Maggy: Scalable Asynchronous Parallel Hyperparameter Search
Show others...
2020 (English)In: Proceedings of the 1st Workshop on Distributed Machine Learning, Association for Computing Machinery , 2020, p. 28-33Conference paper, Published paper (Refereed)
Abstract [en]

Running extensive experiments is essential for building Machine Learning (ML) models. Such experiments usually require iterative execution of many trials with varying run times. In recent years, Apache Spark has become the de-facto standard for parallel data processing in the industry, in which iterative processes are implemented within the bulk-synchronous parallel (BSP) execution model. The BSP approach is also being used to parallelize ML trials in Spark. However, the BSP task synchronization barriers prevent asynchronous execution of trials, which leads to a reduced number of trials that can be run on a given computational budget. In this paper, we introduce Maggy, an open-source framework based on Spark, to execute ML trials asynchronously in parallel, with the ability to early stop poorly performing trials. In the experiments, we compare Maggy with the BSP execution of parallel trials in Spark and show that on random hyperparameter search on a convolutional neural network for the Fashion-MNIST dataset Maggy reduces the required time to execute a fixed number of trials by 33% to 58%, without any loss in the final model accuracy.

Place, publisher, year, edition, pages
Association for Computing Machinery, 2020
Keywords
Scalable Hyperparameter Search, Machine Learning, Asynchronous Hyperparameter Optimization
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-287209 (URN)10.1145/3426745.3431338 (DOI)000709791500005 ()2-s2.0-85097717704 (Scopus ID)
Conference
The 1st Workshop on Distributed Machine Learning (DistributedML'20)
Note

QC 20201207

Available from: 2020-12-04 Created: 2020-12-04 Last updated: 2025-03-04Bibliographically approved
3. AutoAblation: Automated Parallel Ablation Studies for Deep Learning
Open this publication in new window or tab >>AutoAblation: Automated Parallel Ablation Studies for Deep Learning
Show others...
2021 (English)In: EuroMLSys '21: Proceedings of the 1st Workshop on Machine Learning and Systems, Association for Computing Machinery , 2021, p. 55-61Conference paper, Published paper (Refereed)
Abstract [en]

Ablation studies provide insights into the relative contribution of different architectural and regularization components to machine learning models' performance. In this paper, we introduce AutoAblation, a new framework for the design and parallel execution of ablation experiments. AutoAblation provides a declarative approach to defining ablation experiments on model architectures and training datasets, and enables the parallel execution of ablation trials. This reduces the execution time and allows more comprehensive experiments by exploiting larger amounts of computational resources. We show that AutoAblation can provide near-linear scalability by performing an ablation study on the modules of the Inception-v3 network trained on the TenGeoPSAR dataset.  

Place, publisher, year, edition, pages
Association for Computing Machinery, 2021
Keywords
Ablation Studies, Deep Learning, Feature Ablation, Model Ablation, Parallel Trial Execution
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-294424 (URN)10.1145/3437984.3458834 (DOI)000927844400008 ()2-s2.0-85106034900 (Scopus ID)
Conference
The 1st Workshop on Machine Learning and Systems (EuroMLSys '21)
Funder
EU, Horizon 2020
Note

QC 20210527

Available from: 2021-05-17 Created: 2021-05-17 Last updated: 2025-03-04Bibliographically approved
4. The Impact of Importance-Aware Dataset Partitioning on Data-Parallel Training of Deep Neural Networks
Open this publication in new window or tab >>The Impact of Importance-Aware Dataset Partitioning on Data-Parallel Training of Deep Neural Networks
Show others...
2023 (English)In: Distributed Applications and Interoperable Systems - 23rd IFIP WG 6.1 International Conference, DAIS 2023, Held as Part of the 18th International Federated Conference on Distributed Computing Techniques, DisCoTec 2023, Proceedings, Springer Nature , 2023, p. 74-89Conference paper, Published paper (Refereed)
Abstract [en]

Deep neural networks used for computer vision tasks are typically trained on datasets consisting of thousands of images, called examples. Recent studies have shown that examples in a dataset are not of equal importance for model training and can be categorized based on quantifiable measures reflecting a notion of “hardness” or “importance”. In this work, we conduct an empirical study of the impact of importance-aware partitioning of the dataset examples across workers on the performance of data-parallel training of deep neural networks. Our experiments with CIFAR-10 and CIFAR-100 image datasets show that data-parallel training with importance-aware partitioning can perform better than vanilla data-parallel training, which is oblivious to the importance of examples. More specifically, the proper choice of the importance measure, partitioning heuristic, and the number of intervals for dataset repartitioning can improve the best accuracy of the model trained for a fixed number of epochs. We conclude that the parameters related to importance-aware data-parallel training, including the importance measure, number of warmup training epochs, and others defined in the paper, may be considered as hyperparameters of data-parallel model training.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Data-parallel training, Distributed deep learning, Example importance
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-334525 (URN)10.1007/978-3-031-35260-7_5 (DOI)001288526100005 ()2-s2.0-85164268176 (Scopus ID)
Conference
23rd IFIP International Conference on Distributed Applications and Interoperable Systems, DAIS 2023, Lisbon, Portugal, Jun 19 2023 - Jun 23 2023
Note

QC 20230823

Available from: 2023-08-23 Created: 2023-08-23 Last updated: 2025-03-04Bibliographically approved
5. Deep Neural Network Weight Initialization from Hyperparameter Tuning Trials
Open this publication in new window or tab >>Deep Neural Network Weight Initialization from Hyperparameter Tuning Trials
Show others...
2024 (English)In: Neural Information Processing, Springer Nature , 2024Conference paper, Published paper (Refereed)
Abstract [en]

Training of deep neural networks from scratch requires initialization of the neural network weights as a first step. Over the years, many policies and techniques for weight initialization have been proposed and widely used, including Kaiming initialization and different variants of random initialization. On the other hand, another requirement for starting the training stage is to choose and set suitable hyperparameter values, which are usually obtained by performing several hyperparameter tuning trials. In this paper, we study the suitability of weight initialization using weights obtained from different epochs of hyperparameter tuning trials and compare it to Kaiming uniform (random) weight initialization for image classification tasks. Based on an experimental evaluation using ResNet-18, ResNet-152, and InceptionV3 models, and CIFAR-10, CIFAR-100, Tiny ImageNet, and Food-101 datasets, we show that weight initialization from hyperparameter tuning trials can speed up the training of deep neural networks by up to 2x while maintaining or improving the best test accuracy of the trained models, when compared to random initialization.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
weight initialization, deep neural network training, hyperparameter tuning, model training, deep learning
National Category
Computer Sciences Artificial Intelligence
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-358848 (URN)10.1007/978-981-96-6954-7_5 (DOI)
Conference
ICONIP: International Conference on Neural Information Processing, December 2-6, Auckland, New Zeeland
Note

QC 20250303

Available from: 2025-02-28 Created: 2025-02-28 Last updated: 2025-07-01Bibliographically approved
6. Utilizing Large Language Models for Ablation Studies in Machine Learning and Deep Learning
Open this publication in new window or tab >>Utilizing Large Language Models for Ablation Studies in Machine Learning and Deep Learning
Show others...
2025 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In Machine Learning (ML) and Deep Learning (DL) research, ablation studies are typically performed to provide insights into the individual contribution of different building blocks and components of an ML/DL system (e.g., a deep neural network), as well as to justify that certain additions or modifications to an existing ML/DL system can result in the proposed improved performance. Although dedicated frameworks for performing ablation studies have been introduced in recent years, conducting such experiments is still associated with requiring tedious, redundant work, typically involving maintaining redundant and nearly identical versions of code that correspond to different ablation trials. Inspired by the recent promising performance of Large Language Models (LLMs) in the generation and analysis of ML/DL code, in this paper we discuss the potential of LLMs as facilitators of ablation study experiments for scientific research projects that involve or deal with ML and DL models. We first discuss the different ways in which LLMs can be utilized for ablation studies and then present the prototype of a tool called AblationMage, that leverages LLMs to semi-automate the overall process of conducting ablation study experiments. We showcase the usability of AblationMage as a tool through three experiments, including one in which we reproduce the ablation studies from a recently published applied DL paper.

Place, publisher, year, edition, pages
ACM Digital Library, 2025
Keywords
Ablation Studies, Deep Learning, Feature Ablation, Model Ablation, Large Language Models
National Category
Computer Sciences
Research subject
Computer Science; Computer Science
Identifiers
urn:nbn:se:kth:diva-360719 (URN)10.1145/3721146.3721957 (DOI)001477868300025 ()2-s2.0-105003634645 (Scopus ID)
Conference
The 5th Workshop on Machine Learning and Systems (EuroMLSys), co-located with the 20th European Conference on Computer Systems (EuroSys)
Funder
Vinnova, 2016–05193
Note

QC 20250303

Available from: 2025-02-28 Created: 2025-02-28 Last updated: 2025-07-01

Open Access in DiVA

summary(1280 kB)283 downloads
File information
File name SUMMARY01.pdfFile size 1280 kBChecksum SHA-512
601960673ef57a542102c4482ba8661ab014650c465730313940bfcb7da7be99f4e6183e2ea3335533d740b338352476ebe9aed36cf7bbb96817789cf4868c56
Type summaryMimetype application/pdf

Authority records

Sheikholeslami, Sina

Search in DiVA

By author/editor
Sheikholeslami, Sina
By organisation
Software and Computer systems, SCS
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2332 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf