Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Kullback-Leibler Maillard Sampling for Multi-armed Bandits with Bounded Rewards

Authors: Hao Qin, Kwang-Sung Jun, Chicheng Zhang

NeurIPS 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. For Bernoulli TS s log, we approximate the action probability by Monte Carlo Sampling with 1000 samples for each step. Here we estimate the expected reward of the uniform policy which has expected average reward of 0.85 (black dashed line). Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796; however, for half of the trials, the IPW estimator induced by Bernoulli TS s log returns invalid values due to the action probability estimates being zero. Even excluding those invalid values, the Bernoulli TS s logged data induces an MSE of 0.02015. See Appendix H for additional experiments.
Researcher Affiliation Academia Hao Qin University of Arizona EMAIL-Sung Jun University of Arizona EMAIL Zhang University of Arizona EMAIL
Pseudocode Yes Algorithm 1 KL Maillard Sampling (KL-MS)
Open Source Code No The paper does not provide any statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... In this paper, we study the multi-armed bandit setting where reward distributions of all arms are supported on [0, 1]. An important special case is Bernoulli bandits, where for each arm i, ฮฝi = Bernoulli(ยตi) for some ยตi [0, 1].
Dataset Splits No The paper describes experiments in a simulated Bernoulli bandit environment but does not specify any training, validation, or test dataset splits.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments.
Software Dependencies No The paper does not provide any specific software dependencies with version numbers.
Experiment Setup Yes Figure 1: Histogram of the average rewards computed from the offline evaluation where the logged data is collected from Bernoulli TS and KL-MS (Algorithm 1) in a Bernoulli bandit environment with the mean reward (0.8, 0.9) with time horizon T = 10,000. ... Across 2000 trials, the logged data of KL-MS induces an MSE of 0.00796;