On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems

Authors: Baekjin Kim, Ambuj Tewari

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present some experimental results with perturbation based algorithms (Alg.1-(ii),(iii)) and compare them to the UCB1 algorithm in the simulated stochastic K-armed bandit. In all experiments, the number of arms (K) is 10, the number of different episodes is 1000, and true mean rewards (µi) are generated from U[0, 1] [19]. We consider the following four examples of 1-sub-Gaussian reward distributions that will be shifted by true mean µi; (a) Uniform, U[ 1, 1], (b) Rademacher, R{ 1, 1}, (c) Gaussian, N(0, 1), and (d) Gaussian mixture, W N( 1, 1) + (1 W) N(1, 1) where W Bernoulli(1/2). Under each reward setting, we run five different algorithms; UCB1, RCB with Uniform and Rademacher, and FTPL via Gaussian N(0, σ2) and Double-exponential (σ) after we use grid search to tune confidence levels for confidence based algorithms and the parameter σ for FTPL algorithms.
Researcher Affiliation Academia Baekjin Kim Department of Statistics University of Michigan Ann Arbor, MI 48109 baekjin@umich.edu Ambuj Tewari Department of Statistics University of Michigan Ann Arbor, MI 48109 tewaria@umich.edu
Pseudocode Yes Algorithm 1: Randomized probability matching algorithms via Perturbation
Open Source Code Yes 1https://github.com/Kimbaekjin/Perturbation-Methods-Stochastic MAB
Open Datasets No The paper mentions that true mean rewards are generated from U[0, 1] and uses specific reward distributions (Uniform, Rademacher, Gaussian, Gaussian Mixture), which are simulated/synthetic. It does not provide access information for a publicly available or open dataset.
Dataset Splits No The paper states "In all experiments, the number of arms (K) is 10, the number of different episodes is 1000" and discusses reward distributions. However, it does not specify any training/test/validation dataset splits (percentages or sample counts) as it uses simulated data, not a pre-existing dataset with typical splits.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments. It only mentions running "simulated stochastic K-armed bandit" experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers. It refers to algorithms and distributions but no software dependencies.
Experiment Setup Yes In all experiments, the number of arms (K) is 10, the number of different episodes is 1000, and true mean rewards (µi) are generated from U[0, 1] [19]... after we use grid search to tune confidence levels for confidence based algorithms and the parameter σ for FTPL algorithms. All tuned confidence level and parameter are specified in Figure 1.