Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations

Authors: Daniel Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate T-REX on a variety of standard Atari and Mu Jo Co benchmark tasks. Our experiments show that T-REX can extrapolate well, achieving performance that is often more than twice as high as the best-performing demonstration, as well as outperforming state-of-the-art imitation learning algorithms.
Researcher Affiliation Collaboration 1Department of Computer Science, University of Texas at Austin, USA 2Preferred Networks, Japan.
Pseudocode No The paper describes the algorithm conceptually and with mathematical equations, but does not include a formal pseudocode block or algorithm listing.
Open Source Code Yes 1Code available at https://github.com/hiwonjoon/ICML2019-TREX
Open Datasets Yes We first evaluated our proposed method on three robotic locomotion tasks using the Mujoco simulator (Todorov et al., 2012) within Open AI Gym (Brockman et al., 2016), namely Half Cheetah, Hopper, and Ant. [...] We next evaluated T-REX on eight Atari games shown in Table 1. [...] We used novice human demonstrations from the Atari Grand Challenge Dataset (Kurin et al., 2017) for five Atari tasks.
Dataset Splits Yes To generate demonstrations, we trained a Proximal Policy Optimization (PPO) [...] agent [...] For each checkpoint, we generated a trajectory of length 1,000. [...] To evaluate the effect of different levels of suboptimality, we divided the trajectories into different overlapping stages. [...] We trained the reward network using 5,000 random pairs of partial trajectories of length 50, with preference labels based on the trajectory rankings, not the ground-truth return of the partial trajectories.
Hardware Specification No The paper mentions training models and using simulators (Mujoco, OpenAI Gym) but does not specify any particular hardware components like CPU/GPU models or memory specifications used for these experiments.
Software Dependencies Yes We used the PPO implementation from Open AI Baselines (Dhariwal et al., 2017) with the given default hyperparameters.
Experiment Setup Yes We train the reward network using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 1e-4 and a minibatch size of 64 for 10,000 timesteps. [...] We optimized the reward functions using Adam with a learning rate of 5e-5 for 30,000 steps. [...] We trained PPO on the learned reward function for 50 million frames to obtain our final policy.