Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Authors: Hao Qin, Kwang-Sung Jun, Chicheng Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. For Bernoulli TS s log, we approximate the action probability by Monte Carlo Sampling with 1000 samples for each step. Here we estimate the expected reward of the uniform policy which has expected average reward of 0.85 (black dashed line). Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796; however, for half of the trials, the IPW estimator induced by Bernoulli TS s log returns invalid values due to the action probability estimates being zero. Even excluding those invalid values, the Bernoulli TS s logged data induces an MSE of 0.02015. See Appendix H for additional experiments.
Researcher Affiliation Academia Hao Qin University of Arizona hqin@arizona.eduKwang-Sung Jun University of Arizona kjun@cs.arizona.eduChicheng Zhang University of Arizona chichengz@cs.arizona.edu
Pseudocode Yes Algorithm 1 KL Maillard Sampling (KL-MS)
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... In this paper, we study the multi-armed bandit setting where reward distributions of all arms are supported on [0, 1]. An important special case is Bernoulli bandits, where for each arm i, νi = Bernoulli(µi) for some µi [0, 1].
Dataset Splits No The paper describes experiments in a simulated Bernoulli bandit environment but does not specify any training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies No The paper does not provide any specific software dependencies with version numbers.
Experiment Setup Yes Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796;