reproducibilityindex.ai

Thompson Sampling with Information Relaxation Penalties

Authors: Seungki Min, Costis Maglaras, Ciamac C. Moallemi

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Numerical experiments suggest that this new class of policies perform well, in particular in settings where the ﬁnite time horizon introduces signiﬁcant exploration-exploitation tension into the problem.
Researcher Affiliation	Academia	Seungki Min Columbia Business School Costis Maglaras Columbia Business School Ciamac C. Moallemi Columbia Business School
Pseudocode	Yes	Algorithm 1: Information relaxation sampling (IRS) policy
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described.
Open Datasets	No	The paper describes generating data for simulation rather than using a publicly available dataset with concrete access information: 'In the simulation, we randomly generate a set of outcomes ω(1), . . . , ω(S) and measure the performance of each policy π, V (π, T, y), and the performance bounds, W z(T, y), via sample average approximation across these sampled outcomes (S = 20, 000).'
Dataset Splits	No	The paper describes generating S=20,000 outcomes for simulation and evaluation, but does not specify dataset splits (e.g., train/validation/test percentages or counts) as would typically be done for model training and evaluation on pre-existing datasets. The policies themselves are not 'trained' on data splits in the traditional machine learning sense.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers.
Experiment Setup	Yes	We visualize the effectiveness of IRS policies and performance bounds in case of Gaussian MAB with ﬁve arms (K = 5) with different noise variances. More speciﬁcally, each arm a A has the unknown mean reward θa N(0, 12) and yields the stochastic rewards Ra,n N(θa, σ2 a) where σ1 = 0.1, σ2 = 0.4, σ3 = 1.0, σ4 = 4.0 and σ5 = 10.0. Our experiment includes the state-of-the-art algorithms that are particularly suitable in a Bayesian framework: Bayesian upper conﬁdence bound [14] (BAYES-UCB, with a quantile of 1 1 t ), information directed sampling [20] (IDS), and optimistic Gittins index [9] (OGI, one-step look ahead approximation with a discount factor γt = 1 1 t ). In the simulation, we randomly generate a set of outcomes ω(1), . . . , ω(S) and measure the performance of each policy π, V (π, T, y), and the performance bounds, W z(T, y), via sample average approximation across these sampled outcomes (S = 20, 000).