Statistical Inference on Multi-armed Bandits with Delayed Feedback

Authors: Lei Shi, Jingshen Wang, Tianhao Wu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the finite-sample performance of our approach through Monte Carlo experiments. We illustrate the finite-sample performance of our approach through Monte Carlo experiments.
Researcher Affiliation Academia 1Division of Biostatistics, University of California, Berkeley, USA 2Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA.
Pseudocode No The paper describes the ϵ-greedy algorithm in prose in Section 4.2 but does not provide a formal pseudocode block or algorithm listing.
Open Source Code Yes The source code is available from the Git Hub repository: https://github.com/Lei Shi-rocks/Delay Bandits.
Open Datasets No The paper states: "We run ϵ-greedy on binary bandits and evaluate the impact of margins for the performance of varies estimators in Table 1. The potential outcomes are generated from normal distribution with variance 1." This indicates simulated data, not a publicly available dataset with concrete access information.
Dataset Splits No The paper describes simulation setups with generated data and does not mention explicit training, validation, or test splits for any specific dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used for the experiments (e.g., CPU, GPU models, memory, or cloud instances).
Software Dependencies No The paper does not provide specific version numbers for any software libraries, programming languages, or tools used in the experiments.
Experiment Setup Yes In ϵ-greedy algorithms, the algorithm picks the arm at,max that achieves the highest sample average. Then in the following stage the algorithm pulls at,max with probability greater than 1 ϵ (exploitation) and the rest (d 1) arms with probability ϵ/(d 1) (exploration). In particular, we study policy evaluation problem of the power decaying ϵt; i.e., ϵt = t α for some α 0. Mathematically, let Y a,t denote the averaged observed outcome collected on arm a before time t. We run ϵ-greedy on binary bandits and evaluate the impact of margins for the performance of varies estimators in Table 1. The potential outcomes are generated from normal distribution with variance 1. For Arm a = 1, there exists a positive censoring probability: P1 {D = } = 0.5. Arm 2 does not have never observed outcome, i.e., the censoring probability P2 {D = } = 0. We compare two settings with different sizes of margins: (i) zero margin: µ(1) µ(2) = 0; (ii) non-zero margin: µ(1) µ(2) = 0.1.