Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

Authors: Daniel Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on imitation learning for Atari games and demonstrate that we can efficiently compute high-confidence bounds on pol-icy performance
Researcher Affiliation Academia 1Computer Science Department, The University of Texas at Austin. 2Applied Research Laboratories, The University of Texas at Austin.
Pseudocode Yes see the Appendix for full implementation details and pseudo-code
Open Source Code Yes Project page, code, and demonstration data are available at https://sites.google.com/view/bayesianrex/
Open Datasets Yes We selected five Atari games from the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits No The paper mentions using 12 suboptimal demonstrations and later adding 2 more, but it does not specify a training/validation/test split for a dataset, rather it uses demonstrations to learn reward functions and then evaluates policies.
Hardware Specification Yes Running MCMC with 66 preference labels to generate 100,000 reward hypothesis for Atari imitation learning tasks takes approximately 5 minutes on a Dell Inspiron 5577 personal laptop with an Intel i7-7700 processor without using the GPU. In comparison, using standard Bayesian IRL to generate one sample from the posterior takes 10+ hours of training for a parallelized PPO reinforcement learning agent (Dhariwal et al., 2017) on an NVIDIA TITAN V GPU.
Software Dependencies No The paper mentions using 'Proximal Policy Optimization (PPO)' and refers to 'OpenAI Baselines' but does not specify version numbers for these or any other software libraries or dependencies.
Experiment Setup Yes We pre-trained a 64 dimensional latent state embedding φ(s) using the selfsupervised losses shown in Table 1 and the T-REX pairwise preference loss. To optimize a control policy, we used Proximal Policy Optimization (PPO) (Schulman et al., 2017) with the MAP and mean reward functions from the posterior. We ran Bayesian REX to generate 200,000 samples from P(R | D, P). To address some of the ill-posedness of IRL, we normalize the weights w such that w 2 = 1.