kth.sePublications
Change search
Refine search result
1 - 21 of 21
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1. Abe, Kenshi
    et al.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Sakamoto, Mitsuki
    Iwasaki, Atsushi
    A Slingshot Approach to Learning in Monotone GamesManuscript (preprint) (Other academic)
    Abstract [en]

    In this paper, we address the problem of computing equilibria in monotone games.The traditional Follow the Regularized Leader algorithms fail to converge to anequilibrium even in two-player zero-sum games. Although optimistic versions ofthese algorithms have been proposed with last-iterate convergence guarantees, theyrequire noiseless gradient feedback. To overcome this limitation, we present a novelframework that achieves last-iterate convergence even in the presence of noise. Ourkey idea involves perturbing or regularizing the payoffs or utilities of the games.This perturbation serves to pull the current strategy to an anchored strategy, whichwe refer to as a slingshot strategy. First, we establish the convergence rates of ourframework to a stationary point near an equilibrium, regardless of the presenceor absence of noise. Next, we introduce an approach to periodically update theslingshot strategy with the current strategy. We interpret this approach as a proximalpoint method and demonstrate its last-iterate convergence. Our framework iscomprehensive, incorporating existing payoff-regularized algorithms and enablingthe development of new algorithms with last-iterate convergence properties. Finally,we show that our algorithms, based on this framework, empirically exhibit fasterconvergence.

  • 2.
    Abe, Kenshi
    et al.
    CyberAgent, Inc..
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). CyberAgent, Inc..
    Sakamoto, Mitsuki
    Toyoshima, Kentaro
    University of Electro-Communications.
    Iwasaki, Atsushi
    Last-Iterate Convergence with Full and Noisy Feedback in Two-Player Zero-Sum Games2023In: Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, MLResearchPress , 2023, Vol. 206, p. 7999-8028Conference paper (Refereed)
    Abstract [en]

    This paper proposes Mutation-Driven Multiplicative Weights Update (M2WU) for learning an equilibrium in two-player zero-sum normal-form games and proves that it exhibits the last-iterate convergence property in both full and noisy feedback settings. In the former, players observe their exact gradient vectors of the utility functions. In the latter, they only observe the noisy gradient vectors. Even the celebrated Multiplicative Weights Update (MWU) and Optimistic MWU (OMWU) algorithms may not converge to a Nash equilibrium with noisy feedback. On the contrary, M2WU exhibits the last-iterate convergence to a stationary point near a Nash equilibrium in both feedback settings. We then prove that it converges to an exact Nash equilibrium by iteratively adapting the mutation term. We empirically confirm that M2WU outperforms MWU and OMWU in exploitability and convergence rates.

  • 3.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Inference and Online Learning in Structured Stochastic Systems2023Doctoral thesis, comprehensive summary (Other academic)
    Abstract [en]

    This thesis contributes to the field of stochastic online learning problems, with a collection of six papers each addressing unique aspects of online learning and inference problems under specific structures. The first four papers focus on exploration and inference problems, uncovering fundamental information-theoretic limits and efficient algorithms under various structures. The last two papers focus on maximizing rewards by efficiently leveraging these structures.

    The first paper addresses the complex problem of learning to cluster items based on binary user feedback for multiple questions. It establishes information-theoretical error lower bounds for both uniform and adaptive selection strategies under a fixed budget of rounds or users, and proposes an adaptive algorithm that efficiently allocates the budget.The second paper tackles the challenge of uncovering hidden communities in the Labeled Stochastic Block Model using single-shot observations of labels. It introduces a computationally efficient algorithm, Instance-Adaptive Clustering, which is the first to match instance-specific lower bounds on the expected number of misclassified items.The third paper delves into the best-arm identification or simple regret minimization problem within a Bayesian setting. It takes into consideration a prior distribution for the bandit problem and the expectation of simple regret with respect to that distribution, defining it as Bayesian simple regret.It characterizes the rate of Bayesian simple regret assuming certain continuity conditions on the prior, revealing that the leading term of Bayesian simple regret stems from parameters where the gap between optimal and suboptimal actions is less than . The fourth paper contributes to the fixed budget best-arm identification problem for two-arm bandits with Bernoulli rewards. It demonstrates the optimality of uniform sampling, which evenly samples the arms.It proves that no algorithm can outperform uniform sampling while being at least as good as uniform sampling for some bandit instances.The fifth paper revisits the regret minimization problem in sparse stochastic contextual linear bandits. It introduces a new algorithm, the Thresholded Lasso Bandit, which estimates the linear reward function and its sparse support, and then selects an arm based on these estimations. The algorithm achieves superior regret upper bounds compared to previous algorithms and numerically outperforms them.The sixth and final paper provides a theoretical analysis of recommendation systems in an online setting under unknown user-item preference probabilities and some structures. It derives regret lower bounds based on various structural assumptions and designs optimal algorithms that achieve these bounds. The analysis reveals the relative weights of the different components of regret, providing valuable insights into the efficient algorithms for online recommendation systems.

    This thesis addresses the technical challenge of structured stochastic online learning problems, providing new insights into the power and limitations of adaptivity in these problems.

    Download (pdf)
    summary
  • 4.
    Ariu, Kaito
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). Cyberagent Inc, Tokyo, Japan..
    Abe, Kenshi
    Cyberagent Inc, Tokyo, Japan..
    Proutiere, Alexandre
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Thresholded Lasso Bandit2022In: International Conference On Machine Learning, Vol 162 / [ed] Chaudhuri, K Jegelka, S Song, L Szepesvari, C Niu, G Sabato, S, ML Research Press , 2022, p. 878-928Conference paper (Refereed)
    Abstract [en]

    In this paper, we revisit the regret minimization problem in sparse stochastic contextual linear bandits, where feature vectors may be of large dimension d, but where the reward function depends on a few, say s(0) << d, of these features only. We present Thresholded Lasso bandit, an algorithm that (i) estimates the vector defining the reward function as well as its sparse support, i.e., significant feature elements, using the Lasso framework with thresholding, and (ii) selects an arm greedily according to this estimate projected on its support. The algorithm does not require prior knowledge of the sparsity index s0 and can be parameter-free under some symmetric assumptions. For this simple algorithm, we establish non-asymptotic regret upper bounds scaling as O(log d+root T) in general, and as O(log d + log T) under the so-called margin condition (a probabilistic condition on the separation of the arm rewards). The regret of previous algorithms scales as O(log d+ root T log(dT)) and O(log T log d) in the two settings, respectively. Through numerical experiments, we confirm that our algorithm outperforms existing methods.

  • 5.
    Ariu, Kaito
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Kato, Masahiro
    Komiyama, Junpei
    McAlinn, Kenichiro
    Qin, Chao
    Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration SamplingManuscript (preprint) (Other academic)
  • 6.
    Ariu, Kaito
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS). AI Lab, CyberAgent, Inc..
    Kato, Masahiro
    AI Lab, CyberAgent, Inc..
    Komiyama, Junpei
    Stern School of Business, New York University.
    McAlinn, Kenichiro
    Fox School of Business, Temple University.
    Qin, Chao
    Columbia Business School, Columbia University.
    Policy Choice and Best Arm Identification: Asymptotic Analysis of Exploration SamplingManuscript (preprint) (Other academic)
    Download full text (pdf)
    fulltext
  • 7.
    Ariu, Kaito
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Ok, Jungseul
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Proutiere, Alexandre
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Yun, Se-Young
    Optimal Clustering from Noisy Binary FeedbackManuscript (preprint) (Other academic)
    Abstract [en]

    We consider the problem of solving large-scale labeling tasks with minimal effort put on the users. Examples of such tasks include those in some of the recent CAPTCHA systems, where users clicks (binary answers) constitute the only data available to label images. Specifically, we study the generic problem of clustering a set of items from binary user feedback. Items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the {\it hardness} of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.

  • 8.
    Ariu, Kaito
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). CyberAgent, Tokyo, Japan.
    Ok, Jungseul
    POSTECH, Pohang Si, South Korea..
    Proutiere, Alexandre
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Yun, Seyoung
    Korea Adv Inst Sci & Technol, Daejeon, South Korea..
    Optimal clustering from noisy binary feedback2024In: Machine Learning, ISSN 0885-6125, E-ISSN 1573-0565, Vol. 113, no 5, p. 2733-2764Article in journal (Refereed)
    Abstract [en]

    We study the problem of clustering a set of items from binary user feedback. Such a problem arises in crowdsourcing platforms solving large-scale labeling tasks with minimal effort put on the users. For example, in some of the recent reCAPTCHA systems, users clicks (binary answers) can be used to efficiently label images. In our inference problem, items are grouped into initially unknown non-overlapping clusters. To recover these clusters, the learner sequentially presents to users a finite list of items together with a question with a binary answer selected from a fixed finite set. For each of these items, the user provides a noisy answer whose expectation is determined by the item cluster and the question and by an item-specific parameter characterizing the hardness of classifying the item. The objective is to devise an algorithm with a minimal cluster recovery error rate. We derive problem-specific information-theoretical lower bounds on the error rate satisfied by any algorithm, for both uniform and adaptive (list, question) selection strategies. For uniform selection, we present a simple algorithm built upon the K-means algorithm and whose performance almost matches the fundamental limits. For adaptive selection, we develop an adaptive algorithm that is inspired by the derivation of the information-theoretical error lower bounds, and in turn allocates the budget in an efficient way. The algorithm learns to select items hard to cluster and relevant questions more often. We compare the performance of our algorithms with or without the adaptive selection strategy numerically and illustrate the gain achieved by being adaptive.

  • 9.
    Ariu, Kaito
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Proutiere, Alexandre
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Yun, Se-Young
    Instance-Optimal Cluster Recovery in the Labeled Stochastic Block ModelManuscript (preprint) (Other academic)
  • 10.
    Ariu, Kaito
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Ryu, Narae
    Yun, Se-Young
    Proutiere, Alexandre
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Regret in Online Recommendation Systems2020Conference paper (Refereed)
    Abstract [en]

    This paper proposes a theoretical analysis of recommendation systems in an online setting, where items are sequentially recommended to users over time. In each round, a user, randomly picked from a population of $m$ users, requests a recommendation. The decision-maker observes the user and selects an item from a catalogue of $n$ items. Importantly, an item cannot be recommended twice to the same user. The probabilities that a user likes each item are unknown. The performance of the recommendation algorithm is captured through its regret, considering as a reference an Oracle algorithm aware of these probabilities. We investigate various structural assumptions on these probabilities: we derive for each structure regret lower bounds, and devise algorithms achieving these limits. Interestingly, our analysis reveals the relative weights of the different components of regret: the component due to the constraint of not presenting the same item twice to the same user, that due to learning the chances users like items, and finally that arising when learning the underlying structure. 

  • 11.
    Fujimoto, Yuma
    et al.
    Research Center for Integrative Evolutionary Science, SOKENDAI, Japan ; Universal Biology Institute (UBI), The University of Tokyo, Japan ; AI Lab, CyberAgent, Inc., Japan.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). AI Lab, CyberAgent, Inc., Japan.
    Abe, Kenshi
    AI Lab, CyberAgent, Inc., Japan.
    Learning in Multi-Memory Games Triggers Complex Dynamics Diverging from Nash Equilibrium2023In: Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023, p. 118-125Conference paper (Refereed)
    Abstract [en]

    Repeated games consider a situation where multiple agents are motivated by their independent rewards throughout learning. In general, the dynamics of their learning become complex. Especially when their rewards compete with each other like zero-sum games, the dynamics often do not converge to their optimum, i.e., the Nash equilibrium. To tackle such complexity, many studies have understood various learning algorithms as dynamical systems and discovered qualitative insights among the algorithms. However, such studies have yet to handle multi-memory games (where agents can memorize actions they played in the past and choose their actions based on their memories), even though memorization plays a pivotal role in artificial intelligence and interpersonal relationship. This study extends two major learning algorithms in games, i.e., replicator dynamics and gradient ascent, into multi-memory games. Then, we prove their dynamics are identical. Furthermore, theoretically and experimentally, we clarify that the learning dynamics diverge from the Nash equilibrium in multi-memory zero-sum games and reach heteroclinic cycles (sojourn longer around the boundary of the strategy space), providing a fundamental advance in learning in games.

  • 12. Fujimoto, Yuma
    et al.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Abe, Kenshi
    Memory Asymmetry: A Key to Convergence in Zero-Sum GamesManuscript (preprint) (Other academic)
  • 13.
    Fujimoto, Yuma
    et al.
    SOKENDAI, SOKENDAI; The University of Tokyo, The University of Tokyo; CyberAgent, CyberAgent.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control). CyberAgent.
    Abe, Kenshi
    CyberAgent; The University of Electro-Communications.
    Memory Asymmetry Creates Heteroclinic Orbits to Nash Equilibrium in Learning in Zero-Sum Games2024Conference paper (Refereed)
    Abstract [en]

    Learning in games considers how multiple agents maximize their own rewards through repeated games. Memory, an ability that an agent changes his/her action depending on the history of actions in previous games, is often introduced into learning to explore more clever strategies and discuss the decision-making of real agents like humans. However, such games with memory are hard to analyze because they exhibit complex phenomena like chaotic dynamics or divergence from Nash equilibrium. In particular, how asymmetry in memory capacities between agents affects learning in games is still unclear. In response, this study formulates a gradient ascent algorithm in games with asymmetry memory capacities. To obtain theoretical insights into learning dynamics, we first consider a simple case of zero-sum games. We observe complex behavior, where learning dynamics draw a heteroclinic connection from unstable fixed points to stable ones. Despite this complexity, we analyze learning dynamics and prove local convergence to these stable fixed points, i.e., the Nash equilibria. We identify the mechanism driving this convergence: an agent with a longer memory learns to exploit the other, which in turn endows the other’s utility function with strict concavity. We further numerically observe such convergence in various initial strategies, action numbers, and memory lengths. This study reveals a novel phenomenon due to memory asymmetry, providing fundamental strides in learning in games and new insights into computing equilibria.

  • 14. Kato, Masahiro
    et al.
    Abe, Kenshi
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Yasui, Shota
    A practical guide of off-policy evaluation for bandit problemsManuscript (preprint) (Other academic)
    Abstract [en]

    Off-policy evaluation (OPE) is the problem of estimating the value of a target policy from samplesobtained via different policies. Recently, applying OPE methods for bandit problems has garneredattention. For the theoretical guarantees of an estimator of the policy value, the OPE methodsrequire various conditions on the target policy and policy used for generating the samples. However,existing studies did not carefully discuss the practical situation where such conditions hold, andthe gap between them remains. This paper aims to show new results for bridging the gap. Basedon the properties of the evaluation policy, we categorize OPE situations. Then, among practicalapplications, we mainly discuss the best policy selection. For the situation, we propose a metaalgorithm based on existing OPE estimators. We investigate the proposed concepts using syntheticand open real-world datasets in experiments.

  • 15. Kato, Masahiro
    et al.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    The Role of Contextual Information in Best Arm IdentificationManuscript (preprint) (Other academic)
  • 16. Kato, Masahiro
    et al.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Imaizumi, Masaaki
    Nomura, Masahiro
    Qin, Chao
    Optimal Best Arm Identification in Two-Armed Bandits with a Fixed Budget under a Small GapManuscript (preprint) (Other academic)
  • 17. Komiyama, Junpei
    et al.
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Kato, Masahiro
    Qin, Chao
    Rate-optimal Bayesian Simple Regret in Best Arm IdentificationManuscript (preprint) (Other academic)
    Abstract [en]

    We consider best arm identification in the multi-armed bandit problem. Assuming certain continuityconditions of the prior, we characterize the rate of the Bayesian simple regret. Differing from Bayesianregret minimization (Lai, 1987), the leading term in the Bayesian simple regret derives from the regionwhere the gap between optimal and suboptimal arms is smaller than √(log(T)/T). We propose a simple andeasy-to-compute algorithm with its leading term matching with the lower bound up to a constant factor;simulation results support our theoretical findings.

  • 18. Shiino, Hiroaki
    et al.
    Ariu, Kaito
    Abe, Kenshi
    Togashi, Riku
    Exploration of Unranked Items in Safe Online Learning to Re-Rank2023In: SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery (ACM), 2023, p. 1991-1995Conference paper (Refereed)
    Abstract [en]

    Bandit algorithms for online learning to rank (OLTR) problems often aim to maximize long-term revenue by utilizing user feedback. From a practical point of view, however, such algorithms have a high risk of hurting user experience due to their aggressive exploration. Thus, there has been a rising demand for safe exploration in recent years. One approach to safe exploration is to gradually enhance the quality of an original ranking that is already guaranteed acceptable quality. In this paper, we propose a safe OLTR algorithm that efficiently exchanges one of the items in the current ranking with an item outside the ranking (i.e., an unranked item) to perform exploration. We select an unranked item optimistically to explore based on Kullback-Leibler upper confidence bounds (KL-UCB) and safely re-rank the items including the selected one. Through experiments, we demonstrate that the proposed algorithm improves long-term regret from baselines without any safety violation.

  • 19.
    Tzeng, Ruo-Chun
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Computer Science, Theoretical Computer Science, TCS.
    Ohsaka, Naoto
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Matroid Semi-Bandits in Sublinear Time2024In: Proceedings of the 41 st International Conference on Machine Learning,, 2024Conference paper (Refereed)
    Abstract [en]

    We study the matroid semi-bandits problem, where at each round the learner plays a subset of K arms from a feasible set, and the goal is to maximize the expected cumulative linear rewards. Existing algorithms have per-round time complexity at least Ω(K), which becomes expensive when K is large. To address this computational issue, we propose FasterCUCB whose sampling rule takes time sublinear in K for common classes of matroids: O(D polylog(K) polylog(T)) for uniform matroids, partition matroids, and graphical matroids, and O(D√ Kpolylog(T)) for transversal matroids. Here, D is the maximum number of elements in any feasible subset of arms, and T is the horizon. Our technique is based on dynamic maintenance of an approximate maximum-weight basis over inner-product weights. Although the introduction of an approximate maximum-weight basis presents a challenge in regret analysis, we can still guarantee an upper bound on regret as tight as CUCB in the sense that it matches the gap-dependent lower bound by Kveton et al. (2014a) asymptotically.

    Download full text (pdf)
    fulltext
  • 20.
    Wang, Pai-An
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Proutiere, Alexandre
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Jedra, Yassir
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Russo, Alessio
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Optimal Algorithms for Multiplayer Multi-Armed Bandits2020In: Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, PMLR / [ed] Chiappa, S Calandra, R, ML Research Press , 2020Conference paper (Refereed)
    Abstract [en]

    The paper addresses various Multiplayer Multi-Armed Bandit (MMAB) problems, where M decision-makers, or players, collaborate to maximize their cumulative reward. We first investigate the MMAB problem where players selecting the same arms experience a collision (and are aware of it) and do not collect any reward. For this problem, we present DPE1 (Decentralized Parsimonious Exploration), a decentralized algorithm that achieves the same asymptotic regret as that obtained by an optimal centralized algorithm. DPE1 is simpler than the state-of-the-art algorithm SIC-MMAB Boursier and Pen-het (2019), and yet offers better performance guarantees. We then study the MMAB problem without collision, where players may select the same arm. Players sit on vertices of a graph, and in each round, they are able to send a message to their neighbours in the graph. We present DPE2, a simple and asymptotically optimal algorithm that outperforms the state-of-the-art algorithm DD-UCB Martinez-Rubio et al. (2019). Besides, under DPE2, the expected number of bits transmitted by the players in the graph is finite.

  • 21.
    Wang, Po-An
    et al.
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Ariu, Kaito
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    Proutiere, Alexandre
    KTH, School of Electrical Engineering and Computer Science (EECS), Intelligent systems, Decision and Control Systems (Automatic Control).
    On Uniformly Optimal Algorithms for Best Arm Identification in Two-Armed Bandits with Fixed BudgetManuscript (preprint) (Other academic)
1 - 21 of 21
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf