Leveraging (Biased) Information: Multi-armed Bandits with Offline Data

Authors: Wang Chi Cheung, Lixing Lyu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform numerical experiments to evaluate MIN-UCB. We compare MIN-UCB with three related benchmarks. The first policy is vanilla UCB (Auer et al., 2002) that ignores the offline data, which we call PURE-UCB . The second policy chooses an arm with the largest UCBS t(a) defined in (8), which we call UCBS . The third policy is HUCB1 (Shivaswamy & Joachims, 2012), which we call HUCB . In all experiments, we simulate our algorithm MIN-UCB and benchmarks on several instances, with fixed K = 4. Each arms s offline and online reward follows standard Gaussian distribution with mean µoff(a), µon(a), respectively. In each specific instance, we run each algorithms 100 times and report their average and standard error.
Researcher Affiliation Academia 1Department of Industrial Systems Engineering and Management, National University of Singapore, Singapore 2Institute of Operations Research and Analytics, National University of Singapore, Singapore.
Pseudocode Yes Algorithm 1 Policy MIN-UCB
Open Source Code No The paper does not provide an explicit statement or link indicating the release of open-source code for the described methodology.
Open Datasets No The paper describes simulating data from 'standard Gaussian distribution' based on specified means for numerical experiments, rather than using a pre-existing publicly available dataset with a concrete access link or citation.
Dataset Splits No The paper describes simulating a multi-armed bandit problem over 'T time steps' and running experiments, but does not specify explicit training, validation, or test dataset splits in the conventional sense used for fixed datasets.
Hardware Specification No The paper mentions performing numerical experiments but does not provide any specific details about the hardware (e.g., CPU, GPU, memory, or cloud instances) used for these experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) that were used for implementation or experiments.
Experiment Setup Yes In all experiments, we simulate our algorithm MIN-UCB and benchmarks on several instances, with fixed K = 4. Each arms s offline and online reward follows standard Gaussian distribution with mean µoff(a), µon(a), respectively. In each specific instance, we run each algorithms 100 times and report their average and standard error. For simplicity, we assume V (a) = V for any sub-optimal arm a and TS(a) = TS for any arm a. In all experiments, we set µ(on) = (0.8, 0.5, 0.5, 0.5) (hence a = 1). In Figure 1, we set µ(off) = (0.4, 0.6, 0.6, 0.6), V (1) = 0.4, and vary V from 0.1 to 0.6. In Figure 2, we set µ(off) = (0.5, 0.6, 0.6, 0.6), V (1) = 0.3, and vary TS from 1000 to 3000, for V = 0.1, 0.3, respectively.