Online Multi-Armed Bandits with Adaptive Inference

Authors: Maria Dimakopoulou, Zhimei Ren, Zhengyuan Zhou

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through 20 synthetic domain experiments and a semi-synthetic experiment based on data from an A/B test ran at Netflix, we demonstrate that using an adaptive inferential scheme (while still retaining the exploration efficacy of TS) provides clear benefits in online decision making: the proposed DATS algorithm has superior empirical performance to existing baselines (UCB and TS) in terms of regret and sample complexity in identifying the best arm.
Researcher Affiliation Collaboration Maria Dimakopoulou Netflix mdimakopoulou@netflix.com Zhimei Ren University of Chicago zmren@statistics.uchicago.edu Zhengyuan Zhou NYU Stern School of Business zzhou@stern.nyu.edu
Pseudocode Yes Algorithm 1 Doubly-Adaptive Thompson Sampling
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes synthetic domains where true mean rewards are drawn randomly from normal distributions and rewards are generated from normal distributions, and a semi-synthetic experiment using data from an A/B test ran at Netflix to simulate a multi-armed bandit domain. While these allow reproducibility of the data generation process, no pre-existing publicly available dataset with a specific link or citation is provided.
Dataset Splits No The paper does not specify traditional training, validation, or test dataset splits. The experiments are conducted in an online decision-making setting over a horizon T, where data is collected sequentially rather than partitioned from a fixed dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., programming languages, libraries, frameworks, or solvers).
Experiment Setup Yes For TS-Normal [Lattimore and Szepesv ari, 2020], described in section 2.1, we select a weak prior ˆµ0,a = 0, ˆσ2 0,a = 106 for all arms a A and known noise variance σ = 1. For UCB-Normal [Auer et al., 2002], described in section 2.1, we tune β among values 1, 1.5, 2, 2.5, 3, 4 and select the one with the best performance in each domain. For DATS, we select γ = 0.01, κ = 1. Finally, we also evaluate a simplified version of DATS, called DATS-clipping: instead of arm elimination and uniform exploration among the non-eliminated arms that DATS introduces to control diminishing propensity scores, DATS-clipping applies TS to a modified ADR estimator which replaces πs,a in Γs,a and in the adaptive weights ws,a by the clipped propensity max(δ, πs,a). Propensity score clipping is a common practice in the offline evaluation literature [Crump et al., 2009], but this modified estimator is no-longer unbiased. DATS-clipping uses δ = 0.001, κ = 1.