Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Sequential Decision Making with Expert Demonstrations under Unobserved Heterogeneity

Authors: Vahid Balazadeh, Keertana Chidambaram, Viet Nguyen, Rahul G. Krishnan, Vasilis Syrgkanis

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our strategy surpasses existing behaviour cloning, online, and online-offline baselines for multi-armed bandits, Markov decision processes (MDPs), and partially observable MDPs, showcasing the broad reach and utility of Ex Perior in using expert demonstrations across different decision-making setups.
Researcher Affiliation Academia Vahid Balazadeh University of Toronto EMAIL Keertana Chidambaram Stanford University EMAIL Viet Nguyen University of Toronto EMAIL Rahul G. Krishnan University of Toronto EMAIL Vasilis Syrgkanis Stanford University EMAIL
Pseudocode Yes We provide a pseudo-algorithm for Ex Perior in Algorithm 1.
Open Source Code Yes Our code is accessible at https://github.com/vdblm/experior
Open Datasets No The paper describes experiments using synthetic and simulated environments (K-armed Bernoulli bandits, Deep Sea, Frozen Lake) but does not provide specific access information, links, or citations to publicly available datasets for training.
Dataset Splits No The paper mentions training and evaluating models but does not explicitly state the use of a validation set or specific train/test/validation splits. It primarily focuses on training steps and evaluation metrics on test scenarios.
Hardware Specification Yes We have used 110 GPU-hours for all the experiments on Quadro RTX 6000.
Software Dependencies No The paper mentions that code is provided for reproducibility but does not explicitly list software dependencies with version numbers in the main text or appendix for the reader to replicate the environment without checking the code itself.
Experiment Setup Yes Experiments. We consider K-armed Bernoulli bandits for our experimental setup. We evaluate the learning algorithms in terms of the Bayesian regret over multiple (prior) distributions ยต over the unobserved contexts. In particular, we consider up to Nยต = 64 different beta distributions, where their parameters are chosen to span a different range of heterogeneity, consisting of tasks with various expert data informativeness. To estimate the Bayesian regret, we sample Ntask = 128 bandit tasks from each prior distribution and calculate the average regret. We use NE = 1000 expert demonstrations for each prior distribution in our experiments. ... Figure 2 demonstrates the average Bayesian regret for various prior distributions over T = 1,500 episodes with K = 10 arms. ... We conduct trials with Ex Perior and Oracle-TS across various numbers of arms over T = 1,500 episodes... We run all the baselines for 90,000 steps with 30 different seeds.