On the Optimality of Perturbations in Stochastic and Adversarial Multi-armed Bandit Problems
Authors: Baekjin Kim, Ambuj Tewari
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present some experimental results with perturbation based algorithms (Alg.1-(ii),(iii)) and compare them to the UCB1 algorithm in the simulated stochastic K-armed bandit. In all experiments, the number of arms (K) is 10, the number of different episodes is 1000, and true mean rewards (µi) are generated from U[0, 1] [19]. We consider the following four examples of 1-sub-Gaussian reward distributions that will be shifted by true mean µi; (a) Uniform, U[ 1, 1], (b) Rademacher, R{ 1, 1}, (c) Gaussian, N(0, 1), and (d) Gaussian mixture, W N( 1, 1) + (1 W) N(1, 1) where W Bernoulli(1/2). Under each reward setting, we run five different algorithms; UCB1, RCB with Uniform and Rademacher, and FTPL via Gaussian N(0, σ2) and Double-exponential (σ) after we use grid search to tune confidence levels for confidence based algorithms and the parameter σ for FTPL algorithms. |
| Researcher Affiliation | Academia | Baekjin Kim Department of Statistics University of Michigan Ann Arbor, MI 48109 baekjin@umich.edu Ambuj Tewari Department of Statistics University of Michigan Ann Arbor, MI 48109 tewaria@umich.edu |
| Pseudocode | Yes | Algorithm 1: Randomized probability matching algorithms via Perturbation |
| Open Source Code | Yes | 1https://github.com/Kimbaekjin/Perturbation-Methods-Stochastic MAB |
| Open Datasets | No | The paper mentions that true mean rewards are generated from U[0, 1] and uses specific reward distributions (Uniform, Rademacher, Gaussian, Gaussian Mixture), which are simulated/synthetic. It does not provide access information for a publicly available or open dataset. |
| Dataset Splits | No | The paper states "In all experiments, the number of arms (K) is 10, the number of different episodes is 1000" and discusses reward distributions. However, it does not specify any training/test/validation dataset splits (percentages or sample counts) as it uses simulated data, not a pre-existing dataset with typical splits. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments. It only mentions running "simulated stochastic K-armed bandit" experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. It refers to algorithms and distributions but no software dependencies. |
| Experiment Setup | Yes | In all experiments, the number of arms (K) is 10, the number of different episodes is 1000, and true mean rewards (µi) are generated from U[0, 1] [19]... after we use grid search to tune confidence levels for confidence based algorithms and the parameter σ for FTPL algorithms. All tuned confidence level and parameter are specified in Figure 1. |