Leveraging Observations in Bandits: Between Risks and Benefits

Authors: Andrei Lupu, Audrey Durand, Doina Precup6112-6119

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application. and 6 Experiments
Researcher Affiliation Academia Andrei Lupu School of Computer Science Mc Gill University andrei.lupu@mail.mcgill.ca Audrey Durand School of Computer Science Mc Gill University audrey.durand@mcgill.ca Doina Precup School of Computer Science Mc Gill University dprecup@cs.mcgill.ca
Pseudocode Yes Algorithm 1 Target-UCB for rewards in [0, 1].
Open Source Code Yes We also provide an implementation of the Target-UCB algorithm: https://github.com/ lupuandr/Target-UCB.
Open Datasets Yes The resulting dataset can be found at: https://github.com/lupuandr/Target-UCB/tree/ master/Human%20bandit%20dataset.
Dataset Splits No The paper discusses different experimental settings (e.g., 2-actions problem, different α values, multi-agent settings) and uses Bernoulli reward distributions, but does not provide specific train/validation/test dataset splits.
Hardware Specification No The paper does not specify any hardware details such as GPU or CPU models, or memory specifications used for running the experiments.
Software Dependencies No The paper does not mention specific software dependencies or their version numbers (e.g., Python, PyTorch, TensorFlow versions) used in the experiments.
Experiment Setup Yes The following experiments evaluate the potential of Target UCB (C=2) in various settings. Bernoulli reward distributions are used in all experiments. Unless indicated otherwise, all results are obtained by averaging over 2000 independent runs. In all figures, shaded areas indicate one standard deviation above the mean. and The clique size is selected using Equation (6) for δ = 0.001.