reproducibilityindex.ai

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Authors: Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate T-REX on a variety of standard Atari and Mu Jo Co benchmark tasks. Our experiments show that T-REX can extrapolate well, achieving performance that is often more than twice as high as the best-performing demonstration, as well as outperforming state-of-the-art imitation learning algorithms.
Researcher Affiliation	Collaboration	1Department of Computer Science, University of Texas at Austin, USA 2Preferred Networks, Japan.
Pseudocode	No	The paper describes the algorithm conceptually and with mathematical equations, but does not include a formal pseudocode block or algorithm listing.
Open Source Code	Yes	1Code available at https://github.com/hiwonjoon/ICML2019-TREX
Open Datasets	Yes	We ﬁrst evaluated our proposed method on three robotic locomotion tasks using the Mujoco simulator (Todorov et al., 2012) within Open AI Gym (Brockman et al., 2016), namely Half Cheetah, Hopper, and Ant. [...] We next evaluated T-REX on eight Atari games shown in Table 1. [...] We used novice human demonstrations from the Atari Grand Challenge Dataset (Kurin et al., 2017) for ﬁve Atari tasks.
Dataset Splits	Yes	To generate demonstrations, we trained a Proximal Policy Optimization (PPO) [...] agent [...] For each checkpoint, we generated a trajectory of length 1,000. [...] To evaluate the effect of different levels of suboptimality, we divided the trajectories into different overlapping stages. [...] We trained the reward network using 5,000 random pairs of partial trajectories of length 50, with preference labels based on the trajectory rankings, not the ground-truth return of the partial trajectories.
Hardware Specification	No	The paper mentions training models and using simulators (Mujoco, OpenAI Gym) but does not specify any particular hardware components like CPU/GPU models or memory specifications used for these experiments.
Software Dependencies	Yes	We used the PPO implementation from Open AI Baselines (Dhariwal et al., 2017) with the given default hyperparameters.
Experiment Setup	Yes	We train the reward network using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e-4 and a minibatch size of 64 for 10,000 timesteps. [...] We optimized the reward functions using Adam with a learning rate of 5e-5 for 30,000 steps. [...] We trained PPO on the learned reward function for 50 million frames to obtain our ﬁnal policy.