kth.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (10 of 43) Show all publications
Sheikholeslami, S., Ghasemirahni, H., Payberah, A. H., Wang, T., Dowling, J. & Vlassov, V. (2025). Utilizing Large Language Models for Ablation Studies in Machine Learning and Deep Learning. In: : . Paper presented at The 5th Workshop on Machine Learning and Systems (EuroMLSys), co-located with the 20th European Conference on Computer Systems (EuroSys). ACM Digital Library
Open this publication in new window or tab >>Utilizing Large Language Models for Ablation Studies in Machine Learning and Deep Learning
Show others...
2025 (English)Conference paper, Published paper (Refereed)
Abstract [en]

In Machine Learning (ML) and Deep Learning (DL) research, ablation studies are typically performed to provide insights into the individual contribution of different building blocks and components of an ML/DL system (e.g., a deep neural network), as well as to justify that certain additions or modifications to an existing ML/DL system can result in the proposed improved performance. Although dedicated frameworks for performing ablation studies have been introduced in recent years, conducting such experiments is still associated with requiring tedious, redundant work, typically involving maintaining redundant and nearly identical versions of code that correspond to different ablation trials. Inspired by the recent promising performance of Large Language Models (LLMs) in the generation and analysis of ML/DL code, in this paper we discuss the potential of LLMs as facilitators of ablation study experiments for scientific research projects that involve or deal with ML and DL models. We first discuss the different ways in which LLMs can be utilized for ablation studies and then present the prototype of a tool called AblationMage, that leverages LLMs to semi-automate the overall process of conducting ablation study experiments. We showcase the usability of AblationMage as a tool through three experiments, including one in which we reproduce the ablation studies from a recently published applied DL paper.

Place, publisher, year, edition, pages
ACM Digital Library, 2025
Keywords
Ablation Studies, Deep Learning, Feature Ablation, Model Ablation, Large Language Models
National Category
Computer Sciences
Research subject
Computer Science; Computer Science
Identifiers
urn:nbn:se:kth:diva-360719 (URN)10.1145/3721146.3721957 (DOI)001477868300025 ()2-s2.0-105003634645 (Scopus ID)
Conference
The 5th Workshop on Machine Learning and Systems (EuroMLSys), co-located with the 20th European Conference on Computer Systems (EuroSys)
Funder
Vinnova, 2016–05193
Note

QC 20250303

Available from: 2025-02-28 Created: 2025-02-28 Last updated: 2025-07-01
Sheikholeslami, S., Wang, T., Payberah, A. H., Dowling, J. & Vlassov, V. (2024). Deep Neural Network Weight Initialization from Hyperparameter Tuning Trials. In: Neural Information Processing: . Paper presented at ICONIP: International Conference on Neural Information Processing, December 2-6, Auckland, New Zeeland. Springer Nature
Open this publication in new window or tab >>Deep Neural Network Weight Initialization from Hyperparameter Tuning Trials
Show others...
2024 (English)In: Neural Information Processing, Springer Nature , 2024Conference paper, Published paper (Refereed)
Abstract [en]

Training of deep neural networks from scratch requires initialization of the neural network weights as a first step. Over the years, many policies and techniques for weight initialization have been proposed and widely used, including Kaiming initialization and different variants of random initialization. On the other hand, another requirement for starting the training stage is to choose and set suitable hyperparameter values, which are usually obtained by performing several hyperparameter tuning trials. In this paper, we study the suitability of weight initialization using weights obtained from different epochs of hyperparameter tuning trials and compare it to Kaiming uniform (random) weight initialization for image classification tasks. Based on an experimental evaluation using ResNet-18, ResNet-152, and InceptionV3 models, and CIFAR-10, CIFAR-100, Tiny ImageNet, and Food-101 datasets, we show that weight initialization from hyperparameter tuning trials can speed up the training of deep neural networks by up to 2x while maintaining or improving the best test accuracy of the trained models, when compared to random initialization.

Place, publisher, year, edition, pages
Springer Nature, 2024
Keywords
weight initialization, deep neural network training, hyperparameter tuning, model training, deep learning
National Category
Computer Sciences Artificial Intelligence
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-358848 (URN)10.1007/978-981-96-6954-7_5 (DOI)
Conference
ICONIP: International Conference on Neural Information Processing, December 2-6, Auckland, New Zeeland
Note

QC 20250303

Available from: 2025-02-28 Created: 2025-02-28 Last updated: 2025-07-01Bibliographically approved
de la Rua Martinez, J., Buso, F., Kouzoupis, A., Ormenisan, A. A., Niazi, S., Bzhalava, D., . . . Dowling, J. (2024). The Hopsworks Feature Store for Machine Learning. In: SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data: . Paper presented at 2024 International Conferaence on Management of Data, SIGMOD 2024, Santiago, Chile, Jun 9 2024 - Jun 15 2024 (pp. 135-147). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>The Hopsworks Feature Store for Machine Learning
Show others...
2024 (English)In: SIGMOD-Companion 2024 - Companion of the 2024 International Conferaence on Management of Data, Association for Computing Machinery (ACM) , 2024, p. 135-147Conference paper, Published paper (Refereed)
Abstract [en]

Data management is the most challenging aspect of building Machine Learning (ML) systems. ML systems can read large volumes of historical data when training models, but inference workloads are more varied, depending on whether it is a batch or online ML system. The feature store for ML has recently emerged as a single data platform for managing ML data throughout the ML lifecycle, from feature engineering to model training to inference. In this paper, we present the Hopsworks feature store for machine learning as a highly available platform for managing feature data with API support for columnar, row-oriented, and similarity search query workloads. We introduce and address challenges solved by the feature stores related to feature reuse, how to organize data transformations, and how to ensure correct and consistent data between feature engineering, model training, and model inference. We present the engineering challenges in building high-performance query services for a feature store and show how Hopsworks outperforms existing cloud feature stores for training and online inference query workloads.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Series
Proceedings of the ACM SIGMOD International Conference on Management of Data, ISSN 0730-8078
Keywords
arrow flight, duckdb, feature store, mlops, rondb
National Category
Computer Sciences Computer Systems
Identifiers
urn:nbn:se:kth:diva-348769 (URN)10.1145/3626246.3653389 (DOI)2-s2.0-85196429961 (Scopus ID)
Conference
2024 International Conferaence on Management of Data, SIGMOD 2024, Santiago, Chile, Jun 9 2024 - Jun 15 2024
Note

QC 20240628

Part of ISBN 979-840070422-2

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2024-06-28Bibliographically approved
Chikafa, G., Sheikholeslami, S., Niazi, S., Dowling, J. & Vlassov, V. (2023). Cloud-native RStudio on Kubernetes for Hopsworks.
Open this publication in new window or tab >>Cloud-native RStudio on Kubernetes for Hopsworks
Show others...
2023 (English)Manuscript (preprint) (Other academic)
Abstract [en]

In order to fully benefit from cloud computing, services are designed following the “multi-tenant” architectural model, which is aimed at maximizing resource sharing among users. However, multi-tenancy introduces challenges of security, performance isolation, scaling, and customization. RStudio server is an open-source Integrated Development Environment (IDE) accessible over a web browser for the R programming language. We present the design and implementation of a multi-user distributed system on Hopsworks, a data-intensive AI platform, following the multi-tenant model that provides RStudio as Software as a Service (SaaS). We use the most popular cloud-native technologies: Docker and Kubernetes, to solve the problems of performance isolation, security, and scaling that are present in a multi-tenant environment. We further enable secure data sharing in RStudio server instances to provide data privacy and allow collaboration among RStudio users. We integrate our system with Apache Spark, which can scale and handle Big Data processing workloads. Also, we provide a UI where users can provide custom configurations and have full control of their own RStudio server instances. Our system was tested on a Google Cloud Platform cluster with four worker nodes, each with 30GB of RAM allocated to them. The tests on this cluster showed that 44 RStudio servers, each with 2GB of RAM, can be run concurrently. Our system can scale out to potentially support hundreds of concurrently running RStudio servers by adding more resources (CPUs and RAM) to the cluster or system.

Keywords
Multi-tenancy, Cloud-native, Performance Isolation, Security, Scaling, Docker, Kubernetes, SaaS, RStudio, Hopsworks
National Category
Computer Sciences Software Engineering
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-336693 (URN)10.48550/arXiv.2307.09132 (DOI)
Note

QC 20230918

Available from: 2023-09-18 Created: 2023-09-18 Last updated: 2023-09-18Bibliographically approved
Sheikholeslami, S., Payberah, A. H., Wang, T., Dowling, J. & Vlassov, V. (2023). The Impact of Importance-Aware Dataset Partitioning on Data-Parallel Training of Deep Neural Networks. In: Distributed Applications and Interoperable Systems - 23rd IFIP WG 6.1 International Conference, DAIS 2023, Held as Part of the 18th International Federated Conference on Distributed Computing Techniques, DisCoTec 2023, Proceedings: . Paper presented at 23rd IFIP International Conference on Distributed Applications and Interoperable Systems, DAIS 2023, Lisbon, Portugal, Jun 19 2023 - Jun 23 2023 (pp. 74-89). Springer Nature
Open this publication in new window or tab >>The Impact of Importance-Aware Dataset Partitioning on Data-Parallel Training of Deep Neural Networks
Show others...
2023 (English)In: Distributed Applications and Interoperable Systems - 23rd IFIP WG 6.1 International Conference, DAIS 2023, Held as Part of the 18th International Federated Conference on Distributed Computing Techniques, DisCoTec 2023, Proceedings, Springer Nature , 2023, p. 74-89Conference paper, Published paper (Refereed)
Abstract [en]

Deep neural networks used for computer vision tasks are typically trained on datasets consisting of thousands of images, called examples. Recent studies have shown that examples in a dataset are not of equal importance for model training and can be categorized based on quantifiable measures reflecting a notion of “hardness” or “importance”. In this work, we conduct an empirical study of the impact of importance-aware partitioning of the dataset examples across workers on the performance of data-parallel training of deep neural networks. Our experiments with CIFAR-10 and CIFAR-100 image datasets show that data-parallel training with importance-aware partitioning can perform better than vanilla data-parallel training, which is oblivious to the importance of examples. More specifically, the proper choice of the importance measure, partitioning heuristic, and the number of intervals for dataset repartitioning can improve the best accuracy of the model trained for a fixed number of epochs. We conclude that the parameters related to importance-aware data-parallel training, including the importance measure, number of warmup training epochs, and others defined in the paper, may be considered as hyperparameters of data-parallel model training.

Place, publisher, year, edition, pages
Springer Nature, 2023
Keywords
Data-parallel training, Distributed deep learning, Example importance
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-334525 (URN)10.1007/978-3-031-35260-7_5 (DOI)001288526100005 ()2-s2.0-85164268176 (Scopus ID)
Conference
23rd IFIP International Conference on Distributed Applications and Interoperable Systems, DAIS 2023, Lisbon, Portugal, Jun 19 2023 - Jun 23 2023
Note

QC 20230823

Available from: 2023-08-23 Created: 2023-08-23 Last updated: 2025-03-04Bibliographically approved
Hagos, D. H., Kakantousis, T., Sheikholeslami, S., Wang, T., Vlassov, V., Payberah, A. H., . . . Dowling, J. (2022). Scalable Artificial Intelligence for Earth Observation Data Using Hopsworks. Remote Sensing, 14(8), Article ID 1889.
Open this publication in new window or tab >>Scalable Artificial Intelligence for Earth Observation Data Using Hopsworks
Show others...
2022 (English)In: Remote Sensing, E-ISSN 2072-4292, Vol. 14, no 8, article id 1889Article in journal (Refereed) Published
Abstract [en]

This paper introduces the Hopsworks platform to the entire Earth Observation (EO) data community and the Copernicus programme. Hopsworks is a scalable data-intensive open-source Artificial Intelligence (AI) platform that was jointly developed by Logical Clocks and the KTH Royal Institute of Technology for building end-to-end Machine Learning (ML)/Deep Learning (DL) pipelines for EO data. It provides the full stack of services needed to manage the entire life cycle of data in ML. In particular, Hopsworks supports the development of horizontally scalable DL applications in notebooks and the operation of workflows to support those applications, including parallel data processing, model training, and model deployment at scale. To the best of our knowledge, this is the first work that demonstrates the services and features of the Hopsworks platform, which provide users with the means to build scalable end-to-end ML/DL pipelines for EO data, as well as support for the discovery and search for EO metadata. This paper serves as a demonstration and walkthrough of the stages of building a production-level model that includes data ingestion, data preparation, feature extraction, model training, model serving, and monitoring. To this end, we provide a practical example that demonstrates the aforementioned stages with real-world EO data and includes source code that implements the functionality of the platform. We also perform an experimental evaluation of two frameworks built on top of Hopsworks, namely Maggy and AutoAblation. We show that using Maggy for hyperparameter tuning results in roughly half the wall-clock time required to execute the same number of hyperparameter tuning trials using Spark while providing linear scalability as more workers are added. Furthermore, we demonstrate how AutoAblation facilitates the definition of ablation studies and enables the asynchronous parallel execution of ablation trials.

Place, publisher, year, edition, pages
MDPI AG, 2022
Keywords
Hopsworks, Copernicus, Earth Observation, machine learning, deep learning, artificial intelligence, model serving, big data, ablation studies, Maggy, ExtremeEarth
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-311886 (URN)10.3390/rs14081889 (DOI)000787403900001 ()2-s2.0-85129027995 (Scopus ID)
Note

QC 20220506

Available from: 2022-05-06 Created: 2022-05-06 Last updated: 2023-08-28Bibliographically approved
Armgarth, A., Pantzare, S., Arven, P., Lassnig, R., Jinno, H., Gabrielsson, E. O., . . . Berggren, M. (2021). A digital nervous system aiming toward personalized IoT healthcare. Scientific Reports, 11(1), Article ID 7757.
Open this publication in new window or tab >>A digital nervous system aiming toward personalized IoT healthcare
Show others...
2021 (English)In: Scientific Reports, E-ISSN 2045-2322, Vol. 11, no 1, article id 7757Article in journal (Refereed) Published
Abstract [en]

Body area networks (BANs), cloud computing, and machine learning are platforms that can potentially enable advanced healthcare outside the hospital. By applying distributed sensors and drug delivery devices on/in our body and connecting to such communication and decision-making technology, a system for remote diagnostics and therapy is achieved with additional autoregulation capabilities. Challenges with such autarchic on-body healthcare schemes relate to integrity and safety, and interfacing and transduction of electronic signals into biochemical signals, and vice versa. Here, we report a BAN, comprising flexible on-body organic bioelectronic sensors and actuators utilizing two parallel pathways for communication and decision-making. Data, recorded from strain sensors detecting body motion, are both securely transferred to the cloud for machine learning and improved decision-making, and sent through the body using a secure body-coupled communication protocol to auto-actuate delivery of neurotransmitters, all within seconds. We conclude that both highly stable and accurate sensing-from multiple sensors-are needed to enable robust decision making and limit the frequency of retraining. The holistic platform resembles the self-regulatory properties of the nervous system, i.e., the ability to sense, communicate, decide, and react accordingly, thus operating as a digital nervous system.

Place, publisher, year, edition, pages
NATURE RESEARCH, 2021
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-296428 (URN)10.1038/s41598-021-87177-z (DOI)000639562100077 ()33833303 (PubMedID)2-s2.0-85104084403 (Scopus ID)
Note

QC 20210614

Available from: 2021-06-14 Created: 2021-06-14 Last updated: 2022-09-15Bibliographically approved
Sheikholeslami, S., Meister, M., Wang, T., Payberah, A. H., Vlassov, V. & Dowling, J. (2021). AutoAblation: Automated Parallel Ablation Studies for Deep Learning. In: EuroMLSys '21: Proceedings of the 1st Workshop on Machine Learning and Systems: . Paper presented at The 1st Workshop on Machine Learning and Systems (EuroMLSys '21) (pp. 55-61). Association for Computing Machinery
Open this publication in new window or tab >>AutoAblation: Automated Parallel Ablation Studies for Deep Learning
Show others...
2021 (English)In: EuroMLSys '21: Proceedings of the 1st Workshop on Machine Learning and Systems, Association for Computing Machinery , 2021, p. 55-61Conference paper, Published paper (Refereed)
Abstract [en]

Ablation studies provide insights into the relative contribution of different architectural and regularization components to machine learning models' performance. In this paper, we introduce AutoAblation, a new framework for the design and parallel execution of ablation experiments. AutoAblation provides a declarative approach to defining ablation experiments on model architectures and training datasets, and enables the parallel execution of ablation trials. This reduces the execution time and allows more comprehensive experiments by exploiting larger amounts of computational resources. We show that AutoAblation can provide near-linear scalability by performing an ablation study on the modules of the Inception-v3 network trained on the TenGeoPSAR dataset.  

Place, publisher, year, edition, pages
Association for Computing Machinery, 2021
Keywords
Ablation Studies, Deep Learning, Feature Ablation, Model Ablation, Parallel Trial Execution
National Category
Computer Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:kth:diva-294424 (URN)10.1145/3437984.3458834 (DOI)000927844400008 ()2-s2.0-85106034900 (Scopus ID)
Conference
The 1st Workshop on Machine Learning and Systems (EuroMLSys '21)
Funder
EU, Horizon 2020
Note

QC 20210527

Available from: 2021-05-17 Created: 2021-05-17 Last updated: 2025-03-04Bibliographically approved
Ismail, M., Niazi, S., Sundell, M., Ronstrom, M., Haridi, S. & Dowling, J. (2020). Distributed Hierarchical File Systems strike back in the Cloud. In: 2020 IEEE 40th international conference on distributed computing systems (ICDCS): . Paper presented at 40th IEEE International Conference on Distributed Computing Systems (ICDCS), NOV 29-DEC 01, 2020, ELECTR NETWORK (pp. 820-830). Institute of Electrical and Electronics Engineers (IEEE)
Open this publication in new window or tab >>Distributed Hierarchical File Systems strike back in the Cloud
Show others...
2020 (English)In: 2020 IEEE 40th international conference on distributed computing systems (ICDCS), Institute of Electrical and Electronics Engineers (IEEE) , 2020, p. 820-830Conference paper, Published paper (Refereed)
Abstract [en]

Cloud service providers have aligned on availability zones as an important unit of failure and replication for storage systems. An availability zone (AZ) has independent power, networking, and cooling systems and consists of one or more data centers. Multiple AZs in close geographic proximity form a region that can support replicated low latency storage services that can survive the failure of one or more AZs. Recent reductions in inter-AZ latency have made synchronous replication protocols increasingly viable, instead of traditional quorum-based replication protocols. We introduce HopsFS-CL, a distributed hierarchical file system with support for high-availability (HA) across AZs, backed by AZ-aware synchronously replicated metadata and AZ-aware block replication. HopsFS-CL is a redesign of HopsFS, a version of HDFS with distributed metadata, and its design involved making replication protocols and block placement protocols AZ-aware at all layers of its stack: the metadata serving, the metadata storage, and block storage layers. In experiments on a real-world workload from Spotify, we show that HopsFS-CL, deployed in HA mode over 3 AZs, reaches 1.66 million ops/s, and has similar performance to HopsFS when deployed in a single AZ, while preserving the same semantics.

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2020
Series
IEEE International Conference on Distributed Computing Systems, ISSN 1063-6927
National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-299114 (URN)10.1109/ICDCS47774.2020.00108 (DOI)000667971400075 ()2-s2.0-85101968318 (Scopus ID)
Conference
40th IEEE International Conference on Distributed Computing Systems (ICDCS), NOV 29-DEC 01, 2020, ELECTR NETWORK
Note

QC 20210803

Not duplicate with DiVA 1467134

Available from: 2021-08-03 Created: 2021-08-03 Last updated: 2022-06-25Bibliographically approved
Ismail, M., Niazi, S., Sundell, M., Ronström, M., Haridi, S. & Dowling, J. (2020). Distributed Hierarchical File Systems strike back in the Cloud. In: : . Paper presented at 40th IEEE International Conference on Distributed Computing Systems, November 29 - December 1, 2020, Singapore.
Open this publication in new window or tab >>Distributed Hierarchical File Systems strike back in the Cloud
Show others...
2020 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Cloud service providers have aligned on availability zones as an important unit of failure and replication for storage systems. An availability zone (AZ) has independent power, networking, and cooling systems and consists of one or more data centers. Multiple AZs in close geographic proximity form a region that can support replicated low latency storage services that can survive the failure of one or more AZs. Recent reductions in inter-AZ latency have made synchronous replication protocols increasingly viable, instead of traditional quorum-based replication protocols. We introduce HopsFS-CL, a distributed hierarchical file system with support for high- availability (HA) across AZs, backed by AZ-aware synchronously replicated metadata and AZ-aware block replication. HopsFS-CL is a redesign of HopsFS, a version of HDFS with distributed metadata, and its design involved making replication protocols and block placement protocols AZ-aware at all layers of its stack: the metadata serving, the metadata storage, and block storage layers. In experiments on a real-world workload from Spotify, we show that HopsFS-CL, deployed in HA mode over 3 AZs, reaches 1.66 million ops/s, and has similar performance to HopsFS when deployed in a single AZ, while preserving the same semantics.

National Category
Computer Systems
Identifiers
urn:nbn:se:kth:diva-280786 (URN)
Conference
40th IEEE International Conference on Distributed Computing Systems, November 29 - December 1, 2020, Singapore
Note

QC 20210120

Available from: 2020-09-14 Created: 2020-09-14 Last updated: 2022-06-25Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-9484-6714

Search in DiVA

Show all publications