Batched Multi-armed Bandits Problem

Authors: Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This section contains some experimental results on the performances of Ba SE policy under different grids. The default parameters are T = 5 104, K = 3, M = 3 and γ = 1, and the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms. In addition to the minimax and geometric grids, we also experiment on the arithmetic grid with tj = j T/M for j [M]. Figure 1 (a)-(c) display the empirical dependence of the average Ba SE regrets under different grids, together with the comparison with the centralized UCB1 algorithm [ACBF02] without any batch constraints. We observe that the minimax grid typically results in a smallest regret among all grids, and M = 4 batches appear to be sufficient for the Ba SE performance to approach the centralized performance. We also compare our Ba SE algorithm with the ETC algorithm in [PRCS16] for the two-arm case, and Figure 1 (d) shows that Ba SE achieves lower regrets than ETC.
Researcher Affiliation Academia Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou Department of {Statistics, Electrical Engineering, Statistics, Mathematics} Stanford University {zijungao,yjhan,zren,zqzhou}@stanford.edu
Pseudocode Yes Algorithm 1: Batched Successive Elimination (Ba SE)
Open Source Code Yes The source codes of the experiment can be found in https://github.com/Mathegineer/batched-bandit.
Open Datasets No The paper describes a simulated environment with specified reward distributions and parameters, rather than using a pre-existing, publicly available dataset. It states: 'the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms'.
Dataset Splits No The paper evaluates policies in a simulated multi-armed bandit environment. It does not involve traditional dataset splits (training, validation, test) as it pertains to an online learning problem.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies No While source code is provided via a link, the paper itself does not explicitly list software dependencies (e.g., programming languages, libraries, frameworks) with specific version numbers in the text.
Experiment Setup Yes The default parameters are T = 5 104, K = 3, M = 3 and γ = 1, and the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms. In addition to the minimax and geometric grids, we also experiment on the arithmetic grid with tj = j T/M for j [M].