kth.sePublications
Change search
Refine search result
1 - 12 of 12
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Alistarh, Dan
    et al.
    IST Austria, Klosterneuburg, Austria..
    Hoefler, Torsten
    Swiss Fed Inst Technol, Zurich, Switzerland..
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Khirirat, Sarit
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Konstantinov, Nikola
    IST Austria, Klosterneuburg, Austria..
    Renggli, Cedric
    Swiss Fed Inst Technol, Zurich, Switzerland..
    The Convergence of Sparsified Gradient Methods2018In: Advances in Neural Information Processing Systems 31 (NIPS 2018) / [ed] Bengio, S Wallach, H Larochelle, H Grauman, K CesaBianchi, N Garnett, R, Neural Information Processing Systems (NIPS) , 2018, Vol. 31Conference paper (Refereed)
    Abstract [en]

    Stochastic Gradient Descent (SGD) has become the standard tool for distributed training of massive machine learning models, in particular deep neural networks. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed to reduce the overheads of distribution. To date, gradient sparsification methods-where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally-are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to three orders of magnitude, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis also reveals that these methods do require analytical conditions to converge well, justifying and complementing existing heuristics.

  • 2.
    Berglund, Erik
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Khirirat, Sarit
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Wang, Xiaoyu
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Zeroth-order randomized subspace newton methods2022In: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 6002-6006Conference paper (Refereed)
    Abstract [en]

    Zeroth-order methods have become important tools for solving problems where we have access only to function evaluations. However, the zeroth-order methods only using gradient approximations are n times slower than classical first-order methods for solving n-dimensional problems. To accelerate the convergence rate, this paper proposes the zeroth order randomized subspace Newton (ZO-RSN) method, which estimates projections of the gradient and Hessian by random sketching and finite differences. This allows us to compute the Newton step in a lower dimensional subspace, with small computational costs. We prove that ZO-RSN can attain lower iteration complexity than existing zeroth order methods for strongly convex problems. Our numerical experiments show that ZO-RSN can perform black-box attacks under a more restrictive limit on the number of function queries than the state-of-the-art Hessian-aware zeroth-order method.

  • 3.
    Khirirat, Sarit
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    First-Order Algorithms for Communication Efficient Distributed Learning2022Doctoral thesis, monograph (Other academic)
    Abstract [en]

    Innovations in numerical optimization, statistics and high performance computing have enabled tremendous advances in machine learning algorithms, fuelling applications from natural language processing to autonomous driving.To deal with increasing data volumes, and to keep the training times of increasingly complex machine learning models reasonable, modern optimization algorithms distribute both data and computations over a large number of machines. However, these algorithms face challenges, from hardware and data heterogeneity (as different machines have different processors and data) to privacy and security issues (where data can be extracted from the transmitted parameters).Among these challenges, communication overhead constitutes a major performance bottleneck of the algorithms.Communicating millions of problem parameters between machines has been reported to consumeup to 80% of the training time. To alleviate the communication bottleneck, we develop theory and strategies in this thesis to design communication-efficient optimization algorithms. 

    In the first part, we provide unified analysis frameworks for first-order algorithms using direct or error-compensated compression. We first identify key characteristics of the compression errors induced by many of the most popular compression strategies in the literature. We then perform a unified analysis of the convergence properties of first-order algorithms for general families of deterministic and stochastic gradient compression algorithms.Our results give explicit expressions for how compression accuracy and the amount of asynchrony affect the step-sizes and guaranteed convergence times. We next turn our attention to error-compensated compression algorithms.We develop theoretical explanations for why error-compensated compression algorithms attain solutions with arbitrarily higher accuracy than direct compression algorithms.Our results provide strong convergence guarantees of error-compensated compression algorithms for  distributed and federated learning problems. 

    In the second part, we provide flexible tuning frameworks to optimize convergence performance of compression algorithms for a  variety of system architectures. We start by analyzing data-dependent complexity that explains why direct compression algorithms are more communication-efficient than full-precision algorithms in practice. This complexity leads to automatic tuning strategies that enable popular compression algorithms on different communication networks to maximize both the convergence progress towards the solution and the communication efficiency. We then turn our attention to diminishing step-size schedules to maximize the convergence speed of  the algorithms using noisy gradients.Our analysis framework is based on two classes of systems that characterize the impact of the step-sizes on the speed of noisy gradient algorithms.Our results show that such step-size schedules enable these algorithms to enjoy the optimal rate. 

    Applications of the algorithms in the thesis to central machine learning problems on benchmark data validate our theoretical results.

    Download full text (pdf)
    fulltext
  • 4.
    Khirirat, Sarit
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    First-Order Algorithms for Communication Efficient Distributed Learning2019Licentiate thesis, monograph (Other academic)
    Abstract [en]

    Technological developments in devices and storages have made large volumes of data collections more accessible than ever. This transformation leads to optimization problems with massive data in both volume and dimension. In response to this trend, the popularity of optimization on high performance computing architectures has increased unprecedentedly. These scalable optimization solvers can achieve high efficiency by splitting computational loads among multiple machines. However, these methods also incur large communication overhead. To solve optimization problems with millions of parameters, communication between machines has been reported to consume up to 80% of the training time. To alleviate this communication bottleneck, many optimization algorithms with data compression techniques have been studied. In practice, they have been reported to significantly save communication costs while exhibiting almost comparable convergence as the full-precision algorithms. To understand this intuition, we develop theory and techniques in this thesis to design communication-efficient optimization algorithms.

    In the first part, we analyze the convergence of optimization algorithms with direct compression. First, we outline definitions of compression techniques which cover many compressors of practical interest. Then, we provide the unified analysis framework of optimization algorithms with compressors which can be either deterministic or randomized. In particular, we show how the tuning parameters of compressed optimization algorithms must be chosen to guarantee performance. Our results show explicit dependency on compression accuracy and delay effect due to asynchrony of algorithms. This allows us to characterize the trade-off between iteration and communication complexity under gradient compression.

    In the second part, we study how error compensation schemes can improve the performance of compressed optimization algorithms. Even though convergence guarantees of optimization algorithms with error compensation have been established, there is very limited theoretical support which guarantees improved solution accuracy. We therefore develop theoretical explanations, which show that error compensation guarantees arbitrarily high solution accuracy from compressed information. In particular, error compensation helps remove accumulated compression errors, thus improving solution accuracy especially for ill-conditioned problems. We also provide strong convergence analysis of error compensation on parallel stochastic gradient descent across multiple machines. In particular, the error-compensated algorithms, unlike direct compression, result in significant reduction in the compression error.

    Applications of the algorithms in this thesis to real-world problems with benchmark data sets validate our theoretical results.

    Download full text (pdf)
    fulltext
  • 5.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering (EES), Automatic Control. KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre.
    Feyzmahdavian, Hamid Reza
    Johansson, Mikael
    KTH, School of Electrical Engineering (EES), Centres, ACCESS Linnaeus Centre. KTH, School of Electrical Engineering (EES), Automatic Control.
    Mini-batch gradient descent: faster convergence under data sparsity2017In: 2017 IEEE 56TH ANNUAL CONFERENCE ON DECISION AND CONTROL (CDC), IEEE , 2017Conference paper (Refereed)
    Abstract [en]

    The practical performance of stochastic gradient descent on large-scale machine learning tasks is often much better than what current theoretical tools can guarantee. This indicates that there is an inherent structure in these problems that could be exploited to strengthen the analysis. In this paper, we argue that data sparsity is such a property. We derive explicit expressions for how data sparsity affects the range of admissible step-sizes and the convergence factors of mini-batch gradient descent. Our theoretical results are validated by solving least-squares support vector machine problems on both synthetic and real-life data sets. The experimental results demonstrate improved performance of our update rules compared to the traditional mini-batch gradient descent algorithm.

  • 6.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, ACCESS Linnaeus Centre.
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, ACCESS Linnaeus Centre.
    Alistarh, Dan
    IST Austria, Vienna, Austria..
    Gradient compression for communication-limited convex optimization2018In: 2018 IEEE Conference on Decision and Control (CDC), Institute of Electrical and Electronics Engineers (IEEE), 2018, p. 166-171, article id 8619625Conference paper (Refereed)
    Abstract [en]

    Data-rich applications in machine-learning and control have motivated an intense research on large-scale optimization. Novel algorithms have been proposed and shown to have optimal convergence rates in terms of iteration counts. However, their practical performance is severely degraded by the cost of exchanging high-dimensional gradient vectors between computing nodes. Several gradient compression heuristics have recently been proposed to reduce communications, but few theoretical results exist that quantify how they impact algorithm convergence. This paper establishes and strengthens the convergence guarantees for gradient descent under a family of gradient compression techniques. For convex optimization problems, we derive admissible step sizes and quantify both the number of iterations and the number of bits that need to be exchanged to reach a target accuracy. Finally, we validate the performance of different gradient compression techniques in simulations. The numerical results highlight the properties of different gradient compression algorithms and confirm that fast convergence with limited information exchange is possible.

  • 7.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Magnusson, Sindri
    Stockholm Univ, Dept Comp & Syst Sci, Stockholm, Sweden..
    Aytekin, Arda
    Ericsson, Stockholm, Sweden..
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    A Flexible Framework for Communication-Efficient Machine Learning2021In: Thirty-Fifth Aaai Conference On Artificial Intelligence, Thirty-Third Conference On Innovative Applications Of Artificial Intelligence And The Eleventh Symposium On Educational Advances In Artificial Intelligence, Association for the Advancement of Artificial Intelligence , 2021, Vol. 35, p. 8101-8109Conference paper (Refereed)
    Abstract [en]

    With the increasing scale of machine learning tasks, it has become essential to reduce the communication between computing nodes. Early work on gradient compression focused on the bottleneck between CPUs and GPUs, but communication-efficiency is now needed in a variety of different system architectures, from high-performance clusters to energy-constrained IoT devices. In the current practice, compression levels are typically chosen before training and settings that work well for one task may be vastly sub-optimal for another dataset on another architecture. In this paper, we propose a flexible framework which adapts the compression level to the true gradient at each iteration, maximizing the improvement in the objective function that is achieved per communicated bit. Our framework is easy to adapt from one technology to the next by modeling how the communication cost depends on the compression level for the specific technology. Theoretical results and practical experiments indicate that the automatic tuning strategies significantly increase communication efficiency on several state-of-the-art compression schemes.

  • 8.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, ACCESS Linnaeus Centre.
    Magnusson, Sindri
    Stockholm Univ, Dept Comp & Syst Sci DSV, S-11419 Stockholm, Sweden..
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). KTH, School of Electrical Engineering and Computer Science (EECS), Centres, ACCESS Linnaeus Centre.
    Compressed Gradient Methods With Hessian-Aided Error Compensation2021In: IEEE Transactions on Signal Processing, ISSN 1053-587X, E-ISSN 1941-0476, Vol. 69, p. 998-1011Article in journal (Refereed)
    Abstract [en]

    The emergence of big data has caused a dramatic shift in the operating regime for optimization algorithms. The performance bottleneck, which used to be computations, is now often communications. Several gradient compression techniques have been proposed to reduce the communication load at the price of a loss in solution accuracy. Recently, it has been shown how compression errors can be compensated for in the optimization algorithm to improve the solution accuracy. Even though convergence guarantees for error-compensated algorithms have been established, there is very limited theoretical support for quantifying the observed improvements in solution accuracy. In this paper, we show that Hessian-aided error compensation, unlike other existing schemes, avoids accumulation of compression errors on quadratic problems. We also present strong convergence guarantees of Hessian-based error compensation for stochastic gradient descent. Our numerical experiments highlight the benefits of Hessian-based error compensation, and demonstrate that similar convergence improvements are attained when only a diagonal Hessian approximation is used.

  • 9.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Magnusson, Sindri
    Harvard Univ, Sch Engn & Appl Sci, 33 Oxford St, Cambridge, MA 02138 USA..
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    CONVERGENCE BOUNDS FOR COMPRESSED GRADIENT METHODS WITH MEMORY BASED ERROR COMPENSATION2019In: 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) / [ed] RTSEKAS D. P., 2011, Optimization for Machine Learning, V2010, P1 ngni J., 2017, arXiv preprint arXiv: 1710. 09854, u Shengyu, 2016, 2016 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING PROCEEDINGSIEEE International Conference on Acoustics, Speech, and Signal Processing, MAR 20-25, 2016, Shanghai, PEOPLES R CHINA, P4134, IEEE , 2019, p. 2857-2861Conference paper (Refereed)
    Abstract [en]

    The veritable scale of modern data necessitates information compression in parallel/distributed big-data optimization. Compression schemes using memory-based error compensation have displayed superior performance in practice, however, to date there are no theoretical explanations for these observed advantages. This paper provides the first theoretical support for why such compression schemes yields higher accuracy solutions in optimization. Our results cover both gradient and incremental gradient algorithms for quadratic optimization. Unlike previous works, our theoretical results explicitly quantify the accuracy gains from error compensation, especially for ill-conditioned problems. Finally, the numerical results on linear least-squares problems validate the benefit of error compensation and demonstrate tightness of our convergence guarantees.

  • 10.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Magnusson, Sindri
    Stockholm Univ, Stockholm, Sweden..
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Eco-Fedsplit: Federated learning with error-compensated compression2022In: 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), Institute of Electrical and Electronics Engineers (IEEE) , 2022, p. 5952-5956Conference paper (Refereed)
    Abstract [en]

    Federated learning is an emerging framework for collaborative machine-learning on devices which do not want to share local data. State-of-the art methods in federated learning reduce the communication frequency, but are not guaranteed to converge to the optimal model parameters. These methods also experience a communication bottleneck, especially when the devices are power-constrained and communicate over a shared medium. This paper presents ECO-FedSplit, an algorithm that increases the communication efficiency of federated learning without sacrificing solution accuracy. The key is to compress inter-device communication and to compensate for information losses in a theoretically justified manner. We prove strong convergence properties of ECO-FedSplit on strongly convex optimization problems and show that the algorithm yields a highly accurate solution with dramatically reduced communication. Extensive numerical experiments validate our theoretical result on real data sets.

  • 11.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Wang, Xiaoyu
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Magnusson, Sindri
    Stockholm Univ, Dept Comp & Syst Sci, S-11419 Stockholm, Sweden..
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Improved Step-Size Schedules for Proximal Noisy Gradient Methods2023In: IEEE Transactions on Signal Processing, ISSN 1053-587X, E-ISSN 1941-0476, Vol. 71, p. 189-201Article in journal (Refereed)
    Abstract [en]

    Noisy gradient algorithms have emerged as one of the most popular algorithms for distributed optimization with massive data. Choosing proper step-size schedules is an important task to tune in the algorithms for good performance. For the algorithms to attain fast convergence and high accuracy, it is intuitive to use large step-sizes in the initial iterations when the gradient noise is typically small compared to the algorithm-steps, and reduce the step-sizes as the algorithm progresses. This intuition has been confirmed in theory and practice for stochastic gradient descent. However, similar results are lacking for other methods using approximate gradients. This paper shows that the diminishing step-size strategies can indeed be applied for a broad class of noisy gradient algorithms. Our analysis framework is based on two classes of systems that characterize the impact of the step-sizes on the convergence performance of many algorithms. Our results show that such step-size schedules enable these algorithms to enjoy the optimal rate. We exemplify our results on stochastic compression algorithms. Our experiments validate fast convergence of these algorithms with the step decay schedules.

  • 12.
    Khirirat, Sarit
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Wang, Xiaoyu
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Magnússon, Sindri
    Stockholm University .
    Johansson, Mikael
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Improved Step-Size Schedules for Noisy Gradient Methods2021In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada: Institute of Electrical and Electronics Engineers (IEEE) , 2021Conference paper (Refereed)
    Abstract [en]

    Noise is inherited in many optimization methods such as stochastic gradient methods, zeroth-order methods and compressed gradient methods. For such methods to converge toward a global optimum, it is intuitive to use large step-sizes in the initial iterations when the noise is typically small compared to the algorithm-steps, and reduce the step-sizes as the algorithm progresses. This intuition has been con-firmed in theory and practice for stochastic gradient methods, but similar results are lacking for other methods using approximate gradients. This paper shows that the diminishing step-size strategies can indeed be applied for a broad class of noisy gradient methods. Unlike previous works, our analysis framework shows that such step-size schedules enable these methods to enjoy an optimal O(1/k) rate. We exemplify our results on zeroth-order methods and stochastic compression methods. Our experiments validate fast convergence of these methods with the step decay schedules.

    Download full text (pdf)
    fulltext
1 - 12 of 12
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf