Bandits with Delayed, Aggregated Anonymous Feedback

Authors: Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, Steffen Grunewalder

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compared the performance of our algorithm (under different assumptions) to QPM-D (Joulani et al., 2013) in various experimental settings. In these experiments, our aim was to investigate the effect of the delay on the performance of the algorithms. In order to focus on this, we used a simple setup of two arms with Bernoulli rewards and μ = (0.5, 0.6). In every experiment, we ran each algorithm to horizon T = 250000 and used UCB1 (Auer et al., 2002) as the base algorithm in QPM-D. The regret was averaged over 200 replications. For ease of reading, we define ODAAF to be our algorithm using only knowledge of the expected delay, with nm defined as in (2) and run without a bridge period, and ODAAF-B and ODAAF-V to be the versions of Algorithm 1 that use a bridge period and information on the bounded support and the finite variance of the delay to define nm as in (6) and (7) respectively.
Researcher Affiliation Collaboration Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 1 Department of Mathematics and Statistics, Lancaster University, Lancaster, UK 2 Department of Industrial Engineering and Operations Research, Columbia University, New York, NY, USA 3 DeepMind, London, UK 4 Department of Computing Science, University of Alberta, Edmonton, AB, Canada.
Pseudocode Yes Algorithm 1 Optimism for Delayed, Aggregated Anonymous Feedback (ODAAF)
Open Source Code No The paper does not provide any links to source code or explicit statements about code availability.
Open Datasets No The paper uses a simple synthetic setup of two arms with Bernoulli rewards and µ = (0.5, 0.6), which is not a publicly available dataset with concrete access information.
Dataset Splits No The paper describes a bandit problem with sequential interaction up to a horizon T, not a typical supervised learning setup with distinct training, validation, and test splits for a fixed dataset. It defines the environment parameters but does not specify dataset splits in the traditional sense.
Hardware Specification No The paper does not specify any hardware details (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions using UCB1 as a base algorithm, but no specific software versions or dependencies are listed.
Experiment Setup Yes We used a simple setup of two arms with Bernoulli rewards and µ = (0.5, 0.6). In every experiment, we ran each algorithm to horizon T = 250000 and used UCB1 (Auer et al., 2002) as the base algorithm in QPM-D. For ease of reading, we define ODAAF to be our algorithm using only knowledge of the expected delay, with nm defined as in (2) and run without a bridge period, and ODAAF-B and ODAAF-V to be the versions of Algorithm 1 that use a bridge period and information on the bounded support and the finite variance of the delay to define nm as in (6) and (7) respectively.