Thompson Sampling with Information Relaxation Penalties
Authors: Seungki Min, Costis Maglaras, Ciamac C. Moallemi
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Numerical experiments suggest that this new class of policies perform well, in particular in settings where the finite time horizon introduces significant exploration-exploitation tension into the problem. |
| Researcher Affiliation | Academia | Seungki Min Columbia Business School Costis Maglaras Columbia Business School Ciamac C. Moallemi Columbia Business School |
| Pseudocode | Yes | Algorithm 1: Information relaxation sampling (IRS) policy |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | The paper describes generating data for simulation rather than using a publicly available dataset with concrete access information: 'In the simulation, we randomly generate a set of outcomes ω(1), . . . , ω(S) and measure the performance of each policy π, V (π, T, y), and the performance bounds, W z(T, y), via sample average approximation across these sampled outcomes (S = 20, 000).' |
| Dataset Splits | No | The paper describes generating S=20,000 outcomes for simulation and evaluation, but does not specify dataset splits (e.g., train/validation/test percentages or counts) as would typically be done for model training and evaluation on pre-existing datasets. The policies themselves are not 'trained' on data splits in the traditional machine learning sense. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | We visualize the effectiveness of IRS policies and performance bounds in case of Gaussian MAB with five arms (K = 5) with different noise variances. More specifically, each arm a A has the unknown mean reward θa N(0, 12) and yields the stochastic rewards Ra,n N(θa, σ2 a) where σ1 = 0.1, σ2 = 0.4, σ3 = 1.0, σ4 = 4.0 and σ5 = 10.0. Our experiment includes the state-of-the-art algorithms that are particularly suitable in a Bayesian framework: Bayesian upper confidence bound [14] (BAYES-UCB, with a quantile of 1 1 t ), information directed sampling [20] (IDS), and optimistic Gittins index [9] (OGI, one-step look ahead approximation with a discount factor γt = 1 1 t ). In the simulation, we randomly generate a set of outcomes ω(1), . . . , ω(S) and measure the performance of each policy π, V (π, T, y), and the performance bounds, W z(T, y), via sample average approximation across these sampled outcomes (S = 20, 000). |