Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
Authors: Hao Qin, Kwang-Sung Jun, Chicheng Zhang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. For Bernoulli TS s log, we approximate the action probability by Monte Carlo Sampling with 1000 samples for each step. Here we estimate the expected reward of the uniform policy which has expected average reward of 0.85 (black dashed line). Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796; however, for half of the trials, the IPW estimator induced by Bernoulli TS s log returns invalid values due to the action probability estimates being zero. Even excluding those invalid values, the Bernoulli TS s logged data induces an MSE of 0.02015. See Appendix H for additional experiments. |
| Researcher Affiliation | Academia | Hao Qin University of Arizona EMAIL-Sung Jun University of Arizona EMAIL Zhang University of Arizona EMAIL |
| Pseudocode | Yes | Algorithm 1 KL Maillard Sampling (KL-MS) |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... In this paper, we study the multi-armed bandit setting where reward distributions of all arms are supported on [0, 1]. An important special case is Bernoulli bandits, where for each arm i, ฮฝi = Bernoulli(ยตi) for some ยตi [0, 1]. |
| Dataset Splits | No | The paper describes experiments in a simulated Bernoulli bandit environment but does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not provide any specific software dependencies with version numbers. |
| Experiment Setup | Yes | Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796; |