Only Pay for What Is Uncertain: Variance-Adaptive Thompson Sampling

Authors: Aadirupa Saha, Branislav Kveton

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we corroborate our theory with experiments, which demonstrate the benefit of our variance-adaptive Bayesian algorithm over prior frequentist works. We also show that our approach is robust to model misspecification and can be applied with estimated priors.
Researcher Affiliation Industry Aadirupa Saha1 , Branislav Kveton2 1Apple, 2Amazon
Pseudocode Yes Our algorithm is presented in Algorithm 1 (Appendix A) and we call it Gaussian TS... The complete pseudocode of Algorithm 1 is given in Appendix A. Our algorithm is presented in Algorithm 2 and we call it Var TS... The complete pseudocode Algorithm 2 is given in Appendix C.
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes synthetic datasets generated from specific distributions (e.g., Bernoulli, Beta, Gaussian bandits with parameters) for its experiments, but it does not provide concrete access information (link, DOI, repository, or citation to an established public dataset).
Dataset Splits No The paper describes an online learning setting for bandit problems, which involves sequential interaction and reward observation rather than predefined dataset splits. Therefore, it does not provide explicit training/validation/test splits.
Hardware Specification No The paper does not specify any hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup Yes In all simulations, the horizon is n = 2 000 and they are averaged over 1 000 randomly initialized runs. We start with a Bernoulli bandit with K = 10 arms. The mean reward of arm i [K], µi, is sampled i.i.d. from prior Beta(i, K + 1 i). The third experiment is with a Gaussian bandit where both the means and variances of rewards are sampled i.i.d. from a prior with parameters µ0,i = i/(K + 1), κ0,i = K, α0,i = 4, and β0,i = 1. For a given Bayesian bandit, let µi and vi be the estimated mean and variance of the mean reward of arm i sampled from its prior, respectively. Moreover, let λi and νi be the estimated mean and variance of the precision of the reward distribution of arm i sampled from its prior, respectively. Then, using these statistics, we estimate the unknown hyper-parameters of the prior as µ0,i = µi, β0,i = λi/νi, α0,i = β0,i/ λi, and κ0,i = β0,i/(α0,ivi).