Sequential Decision Making with Expert Demonstrations under Unobserved Heterogeneity

Authors: Vahid Balazadeh, Keertana Chidambaram, Viet Nguyen, Rahul G. Krishnan, Vasilis Syrgkanis

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that our strategy surpasses existing behaviour cloning, online, and online-offline baselines for multi-armed bandits, Markov decision processes (MDPs), and partially observable MDPs, showcasing the broad reach and utility of Ex Perior in using expert demonstrations across different decision-making setups.
Researcher Affiliation Academia Vahid Balazadeh University of Toronto vahid@cs.toronto.edu Keertana Chidambaram Stanford University vck@stanford.edu Viet Nguyen University of Toronto viet@cs.toronto.edu Rahul G. Krishnan University of Toronto rahulgk@cs.toronto.edu Vasilis Syrgkanis Stanford University vsyrgk@stanford.edu
Pseudocode Yes We provide a pseudo-algorithm for Ex Perior in Algorithm 1.
Open Source Code Yes Our code is accessible at https://github.com/vdblm/experior
Open Datasets No The paper describes experiments using synthetic and simulated environments (K-armed Bernoulli bandits, Deep Sea, Frozen Lake) but does not provide specific access information, links, or citations to publicly available datasets for training.
Dataset Splits No The paper mentions training and evaluating models but does not explicitly state the use of a validation set or specific train/test/validation splits. It primarily focuses on training steps and evaluation metrics on test scenarios.
Hardware Specification Yes We have used 110 GPU-hours for all the experiments on Quadro RTX 6000.
Software Dependencies No The paper mentions that code is provided for reproducibility but does not explicitly list software dependencies with version numbers in the main text or appendix for the reader to replicate the environment without checking the code itself.
Experiment Setup Yes Experiments. We consider K-armed Bernoulli bandits for our experimental setup. We evaluate the learning algorithms in terms of the Bayesian regret over multiple (prior) distributions µ over the unobserved contexts. In particular, we consider up to Nµ = 64 different beta distributions, where their parameters are chosen to span a different range of heterogeneity, consisting of tasks with various expert data informativeness. To estimate the Bayesian regret, we sample Ntask = 128 bandit tasks from each prior distribution and calculate the average regret. We use NE = 1000 expert demonstrations for each prior distribution in our experiments. ... Figure 2 demonstrates the average Bayesian regret for various prior distributions over T = 1,500 episodes with K = 10 arms. ... We conduct trials with Ex Perior and Oracle-TS across various numbers of arms over T = 1,500 episodes... We run all the baselines for 90,000 steps with 30 different seeds.