One Practical Algorithm for Both Stochastic and Adversarial Bandits

Authors: Yevgeny Seldin, Aleksandrs Slivkins

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results for the stochastic regime are supported by experimental validation.
Researcher Affiliation Collaboration Yevgeny Seldin YEVGENY.SELDIN@GMAIL.COM Queensland University of Technology, Brisbane, Australia Aleksandrs Slivkins SLIVKINS@MICROSOFT.COM Microsoft Research, New York NY, USA
Pseudocode Yes Algorithm 1 Algorithm EXP3++.
Open Source Code No The paper describes the algorithm but does not provide a link or explicit statement about the availability of its source code.
Open Datasets No The paper describes a synthetic data generation process ('stochastic multiarmed bandit problem with Bernoulli rewards... rewards are Bernoulli with bias 0.5 and for the single best arm the reward is Bernoulli with bias 0.5 +') rather than using a pre-existing publicly available dataset with concrete access information.
Dataset Splits No The paper describes parameters for its simulation-based experiments (K values, number of rounds, repetitions) but does not mention explicit train/validation/test dataset splits as typically found in machine learning experiments with static datasets.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU/CPU models, memory, or specific computing environments) used for running the experiments.
Software Dependencies No The paper mentions comparing with other algorithms (EXP3, UCB1, Thompson's sampling) but does not list any specific software dependencies with version numbers.
Experiment Setup Yes We run the experiments with K = 2, K = 10, and K = 100, and = 0.1 and = 0.01 (in total, six combinations of K and ). We run each game for 10^7 rounds and make ten repetitions of each experiment. In the experiments EXP3++ is parametrized by ξt(a) = ln(t ˆ t(a)2) / 32t ˆ t(a)2, where ˆ t(a) is the empirical estimate of (a) defined in (2). In order to demonstrate that in the stochastic regime the exploration parameters are in full control of the performance we run the EXP3++ algorithm with two different learning rates. EXP3++EMP corresponds to ηt = βt and EXP3++ACC corresponds to ηt = 1.