Online Multi-Armed Bandits with Adaptive Inference
Authors: Maria Dimakopoulou, Zhimei Ren, Zhengyuan Zhou
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through 20 synthetic domain experiments and a semi-synthetic experiment based on data from an A/B test ran at Netflix, we demonstrate that using an adaptive inferential scheme (while still retaining the exploration efficacy of TS) provides clear benefits in online decision making: the proposed DATS algorithm has superior empirical performance to existing baselines (UCB and TS) in terms of regret and sample complexity in identifying the best arm. |
| Researcher Affiliation | Collaboration | Maria Dimakopoulou Netflix mdimakopoulou@netflix.com Zhimei Ren University of Chicago zmren@statistics.uchicago.edu Zhengyuan Zhou NYU Stern School of Business zzhou@stern.nyu.edu |
| Pseudocode | Yes | Algorithm 1 Doubly-Adaptive Thompson Sampling |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes synthetic domains where true mean rewards are drawn randomly from normal distributions and rewards are generated from normal distributions, and a semi-synthetic experiment using data from an A/B test ran at Netflix to simulate a multi-armed bandit domain. While these allow reproducibility of the data generation process, no pre-existing publicly available dataset with a specific link or citation is provided. |
| Dataset Splits | No | The paper does not specify traditional training, validation, or test dataset splits. The experiments are conducted in an online decision-making setting over a horizon T, where data is collected sequentially rather than partitioned from a fixed dataset. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers). |
| Experiment Setup | Yes | For TS-Normal [Lattimore and Szepesv ari, 2020], described in section 2.1, we select a weak prior ˆµ0,a = 0, ˆσ2 0,a = 106 for all arms a A and known noise variance σ = 1. For UCB-Normal [Auer et al., 2002], described in section 2.1, we tune β among values 1, 1.5, 2, 2.5, 3, 4 and select the one with the best performance in each domain. For DATS, we select γ = 0.01, κ = 1. Finally, we also evaluate a simplified version of DATS, called DATS-clipping: instead of arm elimination and uniform exploration among the non-eliminated arms that DATS introduces to control diminishing propensity scores, DATS-clipping applies TS to a modified ADR estimator which replaces πs,a in Γs,a and in the adaptive weights ws,a by the clipped propensity max(δ, πs,a). Propensity score clipping is a common practice in the offline evaluation literature [Crump et al., 2009], but this modified estimator is no-longer unbiased. DATS-clipping uses δ = 0.001, κ = 1. |