Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards
Authors: Hao Qin, Kwang-Sung Jun, Chicheng Zhang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. For Bernoulli TS s log, we approximate the action probability by Monte Carlo Sampling with 1000 samples for each step. Here we estimate the expected reward of the uniform policy which has expected average reward of 0.85 (black dashed line). Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796; however, for half of the trials, the IPW estimator induced by Bernoulli TS s log returns invalid values due to the action probability estimates being zero. Even excluding those invalid values, the Bernoulli TS s logged data induces an MSE of 0.02015. See Appendix H for additional experiments. |
| Researcher Affiliation | Academia | Hao Qin University of Arizona hqin@arizona.eduKwang-Sung Jun University of Arizona kjun@cs.arizona.eduChicheng Zhang University of Arizona chichengz@cs.arizona.edu |
| Pseudocode | Yes | Algorithm 1 KL Maillard Sampling (KL-MS) |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... In this paper, we study the multi-armed bandit setting where reward distributions of all arms are supported on [0, 1]. An important special case is Bernoulli bandits, where for each arm i, νi = Bernoulli(µi) for some µi [0, 1]. |
| Dataset Splits | No | The paper describes experiments in a simulated Bernoulli bandit environment but does not specify any training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments. |
| Software Dependencies | No | The paper does not provide any specific software dependencies with version numbers. |
| Experiment Setup | Yes | Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796; |