reproducibilityindex.ai

Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences

Authors: Daniel Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our method on imitation learning for Atari games and demonstrate that we can efﬁciently compute high-conﬁdence bounds on pol-icy performance
Researcher Affiliation	Academia	1Computer Science Department, The University of Texas at Austin. 2Applied Research Laboratories, The University of Texas at Austin.
Pseudocode	Yes	see the Appendix for full implementation details and pseudo-code
Open Source Code	Yes	Project page, code, and demonstration data are available at https://sites.google.com/view/bayesianrex/
Open Datasets	Yes	We selected ﬁve Atari games from the Arcade Learning Environment (Bellemare et al., 2013).
Dataset Splits	No	The paper mentions using 12 suboptimal demonstrations and later adding 2 more, but it does not specify a training/validation/test split for a dataset, rather it uses demonstrations to learn reward functions and then evaluates policies.
Hardware Specification	Yes	Running MCMC with 66 preference labels to generate 100,000 reward hypothesis for Atari imitation learning tasks takes approximately 5 minutes on a Dell Inspiron 5577 personal laptop with an Intel i7-7700 processor without using the GPU. In comparison, using standard Bayesian IRL to generate one sample from the posterior takes 10+ hours of training for a parallelized PPO reinforcement learning agent (Dhariwal et al., 2017) on an NVIDIA TITAN V GPU.
Software Dependencies	No	The paper mentions using 'Proximal Policy Optimization (PPO)' and refers to 'OpenAI Baselines' but does not specify version numbers for these or any other software libraries or dependencies.
Experiment Setup	Yes	We pre-trained a 64 dimensional latent state embedding φ(s) using the selfsupervised losses shown in Table 1 and the T-REX pairwise preference loss. To optimize a control policy, we used Proximal Policy Optimization (PPO) (Schulman et al., 2017) with the MAP and mean reward functions from the posterior. We ran Bayesian REX to generate 200,000 samples from P(R \| D, P). To address some of the ill-posedness of IRL, we normalize the weights w such that w 2 = 1.