Bandits with Delayed, Aggregated Anonymous Feedback
Authors: Ciara Pike-Burke, Shipra Agrawal, Csaba Szepesvari, Steffen Grunewalder
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compared the performance of our algorithm (under different assumptions) to QPM-D (Joulani et al., 2013) in various experimental settings. In these experiments, our aim was to investigate the effect of the delay on the performance of the algorithms. In order to focus on this, we used a simple setup of two arms with Bernoulli rewards and μ = (0.5, 0.6). In every experiment, we ran each algorithm to horizon T = 250000 and used UCB1 (Auer et al., 2002) as the base algorithm in QPM-D. The regret was averaged over 200 replications. For ease of reading, we define ODAAF to be our algorithm using only knowledge of the expected delay, with nm defined as in (2) and run without a bridge period, and ODAAF-B and ODAAF-V to be the versions of Algorithm 1 that use a bridge period and information on the bounded support and the finite variance of the delay to define nm as in (6) and (7) respectively. |
| Researcher Affiliation | Collaboration | Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 1 Department of Mathematics and Statistics, Lancaster University, Lancaster, UK 2 Department of Industrial Engineering and Operations Research, Columbia University, New York, NY, USA 3 DeepMind, London, UK 4 Department of Computing Science, University of Alberta, Edmonton, AB, Canada. |
| Pseudocode | Yes | Algorithm 1 Optimism for Delayed, Aggregated Anonymous Feedback (ODAAF) |
| Open Source Code | No | The paper does not provide any links to source code or explicit statements about code availability. |
| Open Datasets | No | The paper uses a simple synthetic setup of two arms with Bernoulli rewards and µ = (0.5, 0.6), which is not a publicly available dataset with concrete access information. |
| Dataset Splits | No | The paper describes a bandit problem with sequential interaction up to a horizon T, not a typical supervised learning setup with distinct training, validation, and test splits for a fixed dataset. It defines the environment parameters but does not specify dataset splits in the traditional sense. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., CPU, GPU models, memory, or cloud instances) used for running the experiments. |
| Software Dependencies | No | The paper mentions using UCB1 as a base algorithm, but no specific software versions or dependencies are listed. |
| Experiment Setup | Yes | We used a simple setup of two arms with Bernoulli rewards and µ = (0.5, 0.6). In every experiment, we ran each algorithm to horizon T = 250000 and used UCB1 (Auer et al., 2002) as the base algorithm in QPM-D. For ease of reading, we define ODAAF to be our algorithm using only knowledge of the expected delay, with nm defined as in (2) and run without a bridge period, and ODAAF-B and ODAAF-V to be the versions of Algorithm 1 that use a bridge period and information on the bounded support and the finite variance of the delay to define nm as in (6) and (7) respectively. |