Thompson Sampling with Information Relaxation Penalties

Authors: Seungki Min, Costis Maglaras, Ciamac C. Moallemi

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical experiments suggest that this new class of policies perform well, in particular in settings where the finite time horizon introduces significant exploration-exploitation tension into the problem.
Researcher Affiliation Academia Seungki Min Columbia Business School Costis Maglaras Columbia Business School Ciamac C. Moallemi Columbia Business School
Pseudocode Yes Algorithm 1: Information relaxation sampling (IRS) policy
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No The paper describes generating data for simulation rather than using a publicly available dataset with concrete access information: 'In the simulation, we randomly generate a set of outcomes ω(1), . . . , ω(S) and measure the performance of each policy π, V (π, T, y), and the performance bounds, W z(T, y), via sample average approximation across these sampled outcomes (S = 20, 000).'
Dataset Splits No The paper describes generating S=20,000 outcomes for simulation and evaluation, but does not specify dataset splits (e.g., train/validation/test percentages or counts) as would typically be done for model training and evaluation on pre-existing datasets. The policies themselves are not 'trained' on data splits in the traditional machine learning sense.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes We visualize the effectiveness of IRS policies and performance bounds in case of Gaussian MAB with five arms (K = 5) with different noise variances. More specifically, each arm a A has the unknown mean reward θa N(0, 12) and yields the stochastic rewards Ra,n N(θa, σ2 a) where σ1 = 0.1, σ2 = 0.4, σ3 = 1.0, σ4 = 4.0 and σ5 = 10.0. Our experiment includes the state-of-the-art algorithms that are particularly suitable in a Bayesian framework: Bayesian upper confidence bound [14] (BAYES-UCB, with a quantile of 1 1 t ), information directed sampling [20] (IDS), and optimistic Gittins index [9] (OGI, one-step look ahead approximation with a discount factor γt = 1 1 t ). In the simulation, we randomly generate a set of outcomes ω(1), . . . , ω(S) and measure the performance of each policy π, V (π, T, y), and the performance bounds, W z(T, y), via sample average approximation across these sampled outcomes (S = 20, 000).