Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Batched Multi-armed Bandits Problem

Authors: Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou

NeurIPS 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section contains some experimental results on the performances of Ba SE policy under different grids. The default parameters are T = 5 104, K = 3, M = 3 and γ = 1, and the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms. In addition to the minimax and geometric grids, we also experiment on the arithmetic grid with tj = j T/M for j [M]. Figure 1 (a)-(c) display the empirical dependence of the average Ba SE regrets under different grids, together with the comparison with the centralized UCB1 algorithm [ACBF02] without any batch constraints. We observe that the minimax grid typically results in a smallest regret among all grids, and M = 4 batches appear to be sufﬁcient for the Ba SE performance to approach the centralized performance. We also compare our Ba SE algorithm with the ETC algorithm in [PRCS16] for the two-arm case, and Figure 1 (d) shows that Ba SE achieves lower regrets than ETC.
Researcher Affiliation	Academia	Zijun Gao, Yanjun Han, Zhimei Ren, Zhengqing Zhou Department of {Statistics, Electrical Engineering, Statistics, Mathematics} Stanford University EMAIL
Pseudocode	Yes	Algorithm 1: Batched Successive Elimination (Ba SE)
Open Source Code	Yes	The source codes of the experiment can be found in https://github.com/Mathegineer/batched-bandit.
Open Datasets	No	The paper describes a simulated environment with specified reward distributions and parameters, rather than using a pre-existing, publicly available dataset. It states: 'the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms'.
Dataset Splits	No	The paper evaluates policies in a simulated multi-armed bandit environment. It does not involve traditional dataset splits (training, validation, test) as it pertains to an online learning problem.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware (e.g., CPU, GPU models, memory) used to run the experiments.
Software Dependencies	No	While source code is provided via a link, the paper itself does not explicitly list software dependencies (e.g., programming languages, libraries, frameworks) with specific version numbers in the text.
Experiment Setup	Yes	The default parameters are T = 5 104, K = 3, M = 3 and γ = 1, and the mean reward is µ = 0.6 for the optimal arm and is µ = 0.5 for all other arms. In addition to the minimax and geometric grids, we also experiment on the arithmetic grid with tj = j T/M for j [M].