Batched Multi-armed Bandits Problem
Authors: Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This section contains some experimental results on the performances of Ba SE policy under different grids. The default parameters are T = 5 104, K = 3, M = 3 and γ = 1, and the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms. In addition to the minimax and geometric grids, we also experiment on the arithmetic grid with tj = j T/M for j [M]. Figure 1 (a)-(c) display the empirical dependence of the average Ba SE regrets under different grids, together with the comparison with the centralized UCB1 algorithm [ACBF02] without any batch constraints. We observe that the minimax grid typically results in a smallest regret among all grids, and M = 4 batches appear to be sufficient for the Ba SE performance to approach the centralized performance. We also compare our Ba SE algorithm with the ETC algorithm in [PRCS16] for the two-arm case, and Figure 1 (d) shows that Ba SE achieves lower regrets than ETC. |
| Researcher Affiliation | Academia | Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou Department of {Statistics, Electrical Engineering, Statistics, Mathematics} Stanford University {zijungao,yjhan,zren,zqzhou}@stanford.edu |
| Pseudocode | Yes | Algorithm 1: Batched Successive Elimination (Ba SE) |
| Open Source Code | Yes | The source codes of the experiment can be found in https://github.com/Mathegineer/batched-bandit. |
| Open Datasets | No | The paper describes a simulated environment with specified reward distributions and parameters, rather than using a pre-existing, publicly available dataset. It states: 'the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms'. |
| Dataset Splits | No | The paper evaluates policies in a simulated multi-armed bandit environment. It does not involve traditional dataset splits (training, validation, test) as it pertains to an online learning problem. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware (e.g., CPU, GPU models, memory) used to run the experiments. |
| Software Dependencies | No | While source code is provided via a link, the paper itself does not explicitly list software dependencies (e.g., programming languages, libraries, frameworks) with specific version numbers in the text. |
| Experiment Setup | Yes | The default parameters are T = 5 104, K = 3, M = 3 and γ = 1, and the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms. In addition to the minimax and geometric grids, we also experiment on the arithmetic grid with tj = j T/M for j [M]. |