Reward-Biased Maximum Likelihood Estimation for Linear Stochastic Bandits

Authors: Yu-Heng Hung, Ping-Chun Hsieh, Xi Liu, P. R. Kumar7874-7882

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the performance of the proposed RBMLE methods, we conduct a comprehensive empirical comparison with other state-of-the-art methods vis-a-vis three aspects: effectiveness (cumulative regret), efficiency (computation time per decision vs. cumulative regret), and scalability (in number of arms and dimension of contexts). We paid particular attention to fairness of comparison and reproducibility of results. To ensure sample-path sameness for all methods, we compared each method over a pre-prepared dataset containing the context of each arm and the outcomes of pulling each arm over all rounds.
Researcher Affiliation Academia Yu-Heng Hung1,2, Ping-Chun Hsieh1,2, Xi Liu3, P. R. Kumar3 1 Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan 2 Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan 3 Department of Electrical and Computer Engineering, Texas A&M University, College Station, Texas, USA
Pseudocode Yes Algorithm 1 Lin RBMLE Algorithm 1: Input: α(t), λ 2: Initialization: V1 λI 3: for t = 1, 2, do 4: Observe the contexts {xt,a} for all the arms 5: Select the action at = argmaxa bθ t xt,a+ 1 2α(t) xt,a 2 V 1 t and obtain rt 6: Update Vt+1 Vt + xt,atx t,at 7: end for
Open Source Code Yes To ensure the reproducibility of experimental results, we set up the seeds for the random number generators at the beginning of each experiment and provide all the codes.
Open Datasets No The paper describes generating a synthetic dataset:
Dataset Splits No The paper does not specify training/validation/test dataset splits. It describes generating a synthetic dataset for an online learning problem (bandits), where performance is measured by cumulative regret over a time horizon, rather than traditional dataset splits.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers.
Experiment Setup Yes For Lin RBMLE, as suggested by Theorem 1, we choose α(t) = t without any hyper-parameter tuning, and λ = 1 which is a common choice in ridge regression and is not sensitive to the empirical regret. We take α = 1 in Lin UCB and δ = 10 5 in GPUCB. We tune the parameter c in GPUCBT for each experiment and choose c = 0.9 that achieves the best performance. We follow the suggestion of (Kaufmann, Capp e, and Garivier 2012) to choose c = 0 for BUCB. Respecting the restrictions in (Agrawal and Goyal 2013), we take δ = 0.5 and ϵ = 0.9 in Lin TS. In the comparison with IDS and VIDS, we sampled 103 points over the interval [0, 1] for q and take M = 104 in sampling (Algorithm 4 and 6 in (Russo and Van Roy 2018)). In the Bayesian family of benchmark methods (Lin TS, BUCB, KG, KG*, GPUCB, GPUCBT, and VIDS), the prior distribution over the unknown parameters θ is N(0d, Id). The comparison contains 50 trials of experiments and T rounds in each trial.