kth.sePublications KTH
Change search
Link to record
Permanent link

Direct link
Proutiere, AlexandreORCID iD iconorcid.org/0000-0002-4679-4673
Alternative names
Publications (10 of 128) Show all publications
Morais, D., Habbab, A., Fatima, S. B. & Proutiere, A. (2025). Reinforcement Learning with World Models for Autonomous Excavation Optimization in Wheel Loaders. In: : . Paper presented at 66th International Conference of Scandinavian Simulation Society, SIMS 2025, Stavanger, Norway, Sep 22 2025 - Sep 24 2025 (pp. 72-77). Elsevier BV, 59
Open this publication in new window or tab >>Reinforcement Learning with World Models for Autonomous Excavation Optimization in Wheel Loaders
2025 (English)Conference paper, Published paper (Refereed)
Abstract [en]

Automating the bucket-filling task in wheel loaders is challenging due to the complex, nonlinear interaction between the bucket and granular material. This work presents a model-based reinforcement learning approach to optimize the bucket-filling strategy for Zeux, Volvo's autonomous electric wheel loader concept. A Long Short-Term Memory (LSTM) surrogate model is trained on data from Volvo's high-fidelity simulator to emulate realistic dynamics, enabling efficient policy training using Proximal Policy Optimization (PPO) with imagined rollouts. This reduces computational cost and eliminates the need for direct interaction with the high-fidelity simulator. Compared to Volvo's current rule-based driver model, the learned policy achieves 89% improvement in productivity and 56% increase in energy efficiency. Our results show that world models can accelerate reinforcement learning for heavy machinery control, enabling the discovery of strategies that outperform controllers based on human expert behavior.

Place, publisher, year, edition, pages
Elsevier BV, 2025
Keywords
Autonomous Systems, Bucket-Filling, Deep Learning, Heavy Machinery Simulation, Reinforcement Learning, Wheel loaders, World Models
National Category
Control Engineering Computer Sciences Robotics and automation Computer Systems
Identifiers
urn:nbn:se:kth:diva-375953 (URN)10.1016/j.ifacol.2025.12.184 (DOI)2-s2.0-105026936365 (Scopus ID)
Conference
66th International Conference of Scandinavian Simulation Society, SIMS 2025, Stavanger, Norway, Sep 22 2025 - Sep 24 2025
Note

QC 20260129

Available from: 2026-01-29 Created: 2026-01-29 Last updated: 2026-01-29Bibliographically approved
Hultin, H., Hult, H., Proutiere, A., Samama, S. & Tarighati, A. (2024). A deterministic policy gradient method for order execution and option hedging in the presence of market impact. Journal of Financial Data Science, 6(3), 81-114
Open this publication in new window or tab >>A deterministic policy gradient method for order execution and option hedging in the presence of market impact
Show others...
2024 (English)In: Journal of Financial Data Science, ISSN 2640-3943, Vol. 6, no 3, p. 81-114Article in journal (Refereed) Published
Abstract [en]

In this article, an iterative deterministic policy gradient method for finding optimal strategies in the presence of market impact is introduced. The derivation of the policy gradient sheds light on a proper way of handling the market impact of trades in the context of reinforcement learning. Similar to many machine learning methods, the proposed deterministic policy gradient method is based on mini-batch stochastic gradient descent optimization. The method is demonstrated to consistently perform well for several different objectives and market dynamics when applied to the financial applications of order execution and option hedging.

Place, publisher, year, edition, pages
With Intelligence LLC, 2024
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-353470 (URN)10.3905/jfds.2024.1.164 (DOI)2-s2.0-85202532970 (Scopus ID)
Note

QC 20240924

Available from: 2024-09-19 Created: 2024-09-19 Last updated: 2024-11-20Bibliographically approved
Lindståhl, S., Proutiere, A. & Johnsson, A. (2024). Change Point Detection with Adaptive Measurement Schedules for Network Performance Verification. Performance Evaluation Review, 52(1), 83-84
Open this publication in new window or tab >>Change Point Detection with Adaptive Measurement Schedules for Network Performance Verification
2024 (English)In: Performance Evaluation Review, ISSN 0163-5999, E-ISSN 1557-9484, Vol. 52, no 1, p. 83-84Article in journal (Refereed) Published
Abstract [en]

When verifying that a communications network fulfills its specified performance, it is critical to note sudden shifts in network behavior as quickly as possible. Change point detection methods can be useful in this endeavor, but classical methods rely on measuring with a fixed measurement period, which can often be suboptimal in terms of measurement costs. In this paper, we extend the existing framework of change point detection with a notion of physical time. Instead of merely deciding when to stop, agents must now also decide at which future time to take the next measurement. Agents must now minimize the necessary number of measurements pre-and post-change, while maintaining a trade-off between post-change delay and false alarm rate. We establish, through this framework, the suboptimality of typical periodic measurements and propose a simple alternative, called crisis mode agents. We show analytically that crisis mode agents significantly outperform periodic measurements schemes. We further verify this in numerical evaluation, both on an array of synthetic change point detection problems as well as on the problem of detecting traffic load changes in a 5G test bed through end-to-end RTT measurements.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
applied statistics, change detection, hypothesis testing, network management, network measurements
National Category
Control Engineering Computer Sciences
Identifiers
urn:nbn:se:kth:diva-348739 (URN)10.1145/3673660.3655049 (DOI)2-s2.0-85196413157 (Scopus ID)
Note

QC 20250922

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-09-22Bibliographically approved
Lindståhl, S., Proutiere, A. & Johnsson, A. (2024). Change point detection with adaptive measurement schedules for network performance verification. In: SIGMETRICS/PERFORMANCE 2024 - Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems: . Paper presented at 2024 ACM SIGMETRICS/IFIP Performance Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2024, Venice, Italy, June 10-14, 2024 (pp. 83-84). Association for Computing Machinery (ACM)
Open this publication in new window or tab >>Change point detection with adaptive measurement schedules for network performance verification
2024 (English)In: SIGMETRICS/PERFORMANCE 2024 - Abstracts of the 2024 ACM SIGMETRICS/IFIP PERFORMANCE Joint International Conference on Measurement and Modeling of Computer Systems, Association for Computing Machinery (ACM) , 2024, p. 83-84Conference paper, Published paper (Refereed)
Abstract [en]

When verifying that a communications network fulfills its specified performance, it is critical to note sudden shifts in network behavior as quickly as possible. Change point detection methods can be useful in this endeavor, but classical methods rely on measuring with a fixed measurement period, which can often be suboptimal in terms of measurement costs. In this paper, we extend the existing framework of change point detection with a notion of physical time. Instead of merely deciding when to stop, agents must now also decide at which future time to take the next measurement. Agents must now minimize the necessary number of measurements pre- and post-change, while maintaining a trade-off between post-change delay and false alarm rate. We establish, through this framework, the suboptimality of typical periodic measurements and propose a simple alternative, called crisis mode agents. We show analytically that crisis mode agents significantly outperform periodic measurements schemes. We further verify this in numerical evaluation, both on an array of synthetic change point detection problems as well as on the problem of detecting traffic load changes in a 5G test bed through end-to-end RTT measurements.

Place, publisher, year, edition, pages
Association for Computing Machinery (ACM), 2024
Keywords
applied statistics, change detection, hypothesis testing, network management, network measurements
National Category
Control Engineering Computer Sciences
Identifiers
urn:nbn:se:kth:diva-348770 (URN)10.1145/3652963.3655049 (DOI)2-s2.0-85196380214 (Scopus ID)
Conference
2024 ACM SIGMETRICS/IFIP Performance Conference on Measurement and Modeling of Computer Systems, SIGMETRICS/PERFORMANCE 2024, Venice, Italy, June 10-14, 2024
Note

Part of ISBN 9798400706240

QC 20250922

Available from: 2024-06-27 Created: 2024-06-27 Last updated: 2025-09-22Bibliographically approved
Zheng, F. & Proutiere, A. (2024). Conformal Predictions under Markovian Data. In: International Conference on Machine Learning, ICML 2024: . Paper presented at 41st International Conference on Machine Learning, ICML 2024, Vienna, Austria, Jul 21 2024 - Jul 27 2024 (pp. 61470-61497). ML Research Press
Open this publication in new window or tab >>Conformal Predictions under Markovian Data
2024 (English)In: International Conference on Machine Learning, ICML 2024, ML Research Press , 2024, p. 61470-61497Conference paper, Published paper (Refereed)
Abstract [en]

We study the split Conformal Prediction method when applied to Markovian data.We quantify the gap in terms of coverage induced by the correlations in the data (compared to exchangeable data).This gap strongly depends on the mixing properties of the underlying Markov chain, and we prove that it typically scales as (Equation presented) (where tmix is the mixing time of the chain).We also derive upper bounds on the impact of the correlations on the size of the prediction set.Finally we present K-split CP, a method that consists in thinning the calibration dataset and that adapts to the mixing properties of the chain.Its coverage gap is reduced to tmix/(nln(n)) without really affecting the size of the prediction set.We finally test our algorithms on synthetic and real-world datasets.

Place, publisher, year, edition, pages
ML Research Press, 2024
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:kth:diva-353955 (URN)2-s2.0-85203831312 (Scopus ID)
Conference
41st International Conference on Machine Learning, ICML 2024, Vienna, Austria, Jul 21 2024 - Jul 27 2024
Note

QC 20240925

Available from: 2024-09-25 Created: 2024-09-25 Last updated: 2024-09-25Bibliographically approved
Yuan, D., Wang, L., Proutiere, A. & Shi, G. (2024). Distributed zeroth-order optimization: Convergence rates that match centralized counterpart. Automatica, 159, Article ID 111328.
Open this publication in new window or tab >>Distributed zeroth-order optimization: Convergence rates that match centralized counterpart
2024 (English)In: Automatica, ISSN 0005-1098, E-ISSN 1873-2836, Vol. 159, article id 111328Article in journal (Refereed) Published
Abstract [en]

Zeroth-order optimization has become increasingly important in complex optimization and machine learning when cost functions are impossible to be described in closed analytical forms. The key idea of zeroth-order optimization lies in the ability for a learner to build gradient estimates by queries sent to the cost function, and then traditional gradient descent algorithms can be executed replacing gradients by the estimates. For optimization over large-scale multi-agent systems with decentralized data and costs, zeroth-order optimization can continue to be utilized to develop scalable and distributed algorithms. In this paper, we aim at understanding the trend in performance transitioning from centralized to distributed zeroth-order algorithms in terms of convergence rates, and focus on multi-agent systems with time-varying communication networks. We establish a series of convergence rates for distributed zeroth-order subgradient algorithms under both one-point and two-point zeroth-order oracles. Apart from the additional node-to-node communication cost due to the distributed nature of algorithms, the established rates in convergence are shown to match their centralized counterpart. We also propose a multi-stage distributed zeroth-order algorithm that better utilizes the learning rates, reduces the computational complexity, and attains even faster convergence rates for compact decision set.

Place, publisher, year, edition, pages
Elsevier BV, 2024
Keywords
Distributed optimization, Multi-stage optimization algorithm, Optimal convergence rate, Zeroth-order optimization
National Category
Computational Mathematics
Identifiers
urn:nbn:se:kth:diva-339481 (URN)10.1016/j.automatica.2023.111328 (DOI)001101835000001 ()2-s2.0-85175044513 (Scopus ID)
Note

QC 20231113

Available from: 2023-11-13 Created: 2023-11-13 Last updated: 2025-12-05Bibliographically approved
Vannella, F., Proutiere, A., Jedra, Y. & Jeong, J. (2024). Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandits Approach. IEEE Transactions on Mobile Computing, 23(12), 12666-12679
Open this publication in new window or tab >>Learning Optimal Antenna Tilt Control Policies: A Contextual Linear Bandits Approach
2024 (English)In: IEEE Transactions on Mobile Computing, ISSN 1536-1233, E-ISSN 1558-0660, Vol. 23, no 12, p. 12666-12679Article in journal (Refereed) Published
Abstract [en]

Controlling antenna tilts in cellular networks is critical to achieve a good trade-off between network coverage and capacity. We devise algorithms learning optimal tilt control policies from existing data (passive learning setting) or from data actively generated by the algorithms (active learning setting). We formalize the design of such algorithms as a Best Policy Identification problem in Contextual Linear Bandits (CLB). In CLB, an action represents an antenna tilt update; the context captures current network conditions; the reward corresponds to an improvement of performance, mixing coverage and capacity. The objective is to identify an approximately optimal policy (a function mapping the context to an action with maximal reward). For both active and passive learning, we derive information-theoretical lower bounds on the number of samples required by any algorithm returning an approximately optimal policy with a given level of certainty, and devise algorithms achieving these fundamental limits. We apply our algorithms to the Remote Electrical Tilt optimization problem in cellular networks, and show that they can produce optimal tilt update policy using much fewer data samples than naive or existing rule-based learning algorithms. This paper is an extension of work presented at IEEE International Conference on Computer Communications (INFOCOM) 2022 (Vannella et al. 2022).

Place, publisher, year, edition, pages
Institute of Electrical and Electronics Engineers (IEEE), 2024
Keywords
Antennas, Optimization, Approximation algorithms, Interference, Mobile computing, Control systems, Context modeling, Antenna tilt optimization, coverage capacity optimization, best policy identification, contextual linear bandits
National Category
Control Engineering
Identifiers
urn:nbn:se:kth:diva-357812 (URN)10.1109/TMC.2024.3424192 (DOI)001359244600289 ()2-s2.0-85197552454 (Scopus ID)
Note

QC 20241217

Available from: 2024-12-17 Created: 2024-12-17 Last updated: 2025-01-17Bibliographically approved
Jedra, Y., Réveillard, W., Stojanovic, S. & Proutiere, A. (2024). Low-rank bandits via tight two-to-infinity singular subspace recovery. In: International Conference on Machine Learning, ICML 2024: . Paper presented at 41st International Conference on Machine Learning, ICML 2024, July 21-27, 2024, Vienna, Austria (pp. 21430-21485). ML Research Press
Open this publication in new window or tab >>Low-rank bandits via tight two-to-infinity singular subspace recovery
2024 (English)In: International Conference on Machine Learning, ICML 2024, ML Research Press , 2024, p. 21430-21485Conference paper, Published paper (Refereed)
Abstract [en]

We study contextual bandits with low-rank structure where, in each round, if the (context, arm) pair (i, j) ∈ [m] × [n] is selected, the learner observes a noisy sample of the (i, j)-th entry of an unknown low-rank reward matrix. Successive contexts are generated randomly in an i.i.d. manner and are revealed to the learner. For such bandits, we present efficient algorithms for policy evaluation, best policy identification and regret minimization. For policy evaluation and best policy identification, we show that our algorithms are nearly minimax optimal. For instance, the number of samples required to return an ε-optimal policy with probability at least 1 - δ typically scales as m + n/ε2 log(1/δ). Our regret minimization algorithm enjoys minimax guarantees typically scaling as r5/4(m + n)3/4 √T, which improves over existing algorithms. All the proposed algorithms consist of two phases: they first leverage spectral methods to estimate the left and right singular subspaces of the low-rank reward matrix. We show that these estimates enjoy tight error guarantees in the two-to-infinity norm. This in turn allows us to reformulate our problems as a misspecified linear bandit problem with dimension roughly r(m + n) and misspecification controlled by the subspace recovery error, as well as to design the second phase of our algorithms efficiently.

Place, publisher, year, edition, pages
ML Research Press, 2024
National Category
Control Engineering
Identifiers
urn:nbn:se:kth:diva-353951 (URN)2-s2.0-85203793840 (Scopus ID)
Conference
41st International Conference on Machine Learning, ICML 2024, July 21-27, 2024, Vienna, Austria
Note

QC 20240926

Available from: 2024-09-25 Created: 2024-09-25 Last updated: 2024-09-26Bibliographically approved
Stojanovic, S., Jedra, Y. & Proutiere, A. (2024). Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation. In: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024: . Paper presented at 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, Dec 9 2024 - Dec 15 2024. Neural Information Processing Systems Foundation
Open this publication in new window or tab >>Model-free Low-Rank Reinforcement Learning via Leveraged Entry-wise Matrix Estimation
2024 (English)In: Advances in Neural Information Processing Systems 37 - 38th Conference on Neural Information Processing Systems, NeurIPS 2024, Neural Information Processing Systems Foundation , 2024Conference paper, Published paper (Refereed)
Abstract [en]

We consider the problem of learning an ε-optimal policy in controlled dynamical systems with low-rank latent structure. For this problem, we present LoRa-PI (Low-Rank Policy Iteration), a model-free learning algorithm alternating between policy improvement and policy evaluation steps. In the latter, the algorithm estimates the low-rank matrix corresponding to the (state, action) value function of the current policy using the following two-phase procedure. The entries of the matrix are first sampled uniformly at random to estimate, via a spectral method, the leverage scores of its rows and columns. These scores are then used to extract a few important rows and columns whose entries are further sampled. The algorithm exploits these new samples to complete the matrix estimation using a CUR-like method. For this leveraged matrix estimation procedure, we establish entry-wise guarantees that remarkably, do not depend on the coherence of the matrix but only on its spikiness. These guarantees imply that LoRa-PI learns an ε-optimal policy using Õ(S+A/poly(1 - γ)ε2) samples where S (resp. A) denotes the number of states (resp. actions) and γ the discount factor. Our algorithm achieves this order-optimal (in S, A and ε) sample complexity under milder conditions than those assumed in previously proposed approaches.

Place, publisher, year, edition, pages
Neural Information Processing Systems Foundation, 2024
National Category
Control Engineering Signal Processing
Identifiers
urn:nbn:se:kth:diva-361953 (URN)2-s2.0-105000534247 (Scopus ID)
Conference
38th Conference on Neural Information Processing Systems, NeurIPS 2024, Vancouver, Canada, Dec 9 2024 - Dec 15 2024
Note

QC 20250404

Available from: 2025-04-03 Created: 2025-04-03 Last updated: 2025-09-22Bibliographically approved
Wang, P.-A., Ariu, K. & Proutiere, A. (2024). On Universally Optimal Algorithms for A/B Testing. In: International Conference on Machine Learning, ICML 2024: . Paper presented at 41st International Conference on Machine Learning, ICML 2024, Vienna, Austria, Jul 21 2024 - Jul 27 2024 (pp. 50065-50091). ML Research Press
Open this publication in new window or tab >>On Universally Optimal Algorithms for A/B Testing
2024 (English)In: International Conference on Machine Learning, ICML 2024, ML Research Press , 2024, p. 50065-50091Conference paper, Published paper (Refereed)
Abstract [en]

We study the problem of best-arm identification with fixed budget in stochastic multi-armed bandits with Bernoulli rewards. For the problem with two arms, also known as the A/B testing problem, we prove that there is no algorithm that (i) performs as well as the algorithm sampling each arm equally (referred to as the uniform sampling algorithm) in all instances, and that (ii) strictly outperforms uniform sampling on at least one instance. In short, there is no algorithm better than the uniform sampling algorithm. To establish this result, we first introduce the natural class of consistent and stable algorithms, and show that any algorithm that performs as well as the uniform sampling algorithm in all instances belongs to this class. The proof then proceeds by deriving a lower bound on the error rate satisfied by any consistent and stable algorithm, and by showing that the uniform sampling algorithm matches this lower bound. Our results provide a solution to the two open problems presented in (Qin, 2022). For the general problem with more than two arms, we provide a first set of results. We characterize the asymptotic error rate of the celebrated Successive Rejects (SR) algorithm (Audibert et al., 2010) and show that, surprisingly, the uniform sampling algorithm outperforms the SR algorithm in some instances.

Place, publisher, year, edition, pages
ML Research Press, 2024
National Category
Computer Sciences
Identifiers
urn:nbn:se:kth:diva-353954 (URN)2-s2.0-85203839125 (Scopus ID)
Conference
41st International Conference on Machine Learning, ICML 2024, Vienna, Austria, Jul 21 2024 - Jul 27 2024
Note

QC 20240925

Available from: 2024-09-25 Created: 2024-09-25 Last updated: 2024-09-25Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-4679-4673

Search in DiVA

Show all publications