Adversarial Imitation via Variational Inverse Reinforcement Learning

Authors: Ahmed H. Qureshi, Byron Boots, Michael C. Yip

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our approach on various high-dimensional complex control tasks. We also test our learned rewards in challenging transfer learning problems where training and testing environments are made to be different from each other in terms of dynamics or structure. The results show that our proposed method not only learns nearoptimal rewards and policies that are matching expert behavior but also performs significantly better than state-of-the-art inverse reinforcement learning algorithms.
Researcher Affiliation Academia Ahmed H. Qureshi Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA a1qureshi@ucsd.edu Byron Boots College of Computing Georgia Institute of Technology Atlanta, GA 30332, USA bboots@cc.gatech.edu Michael C. Yip Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA yip@ucsd.edu
Pseudocode Yes Algorithm 1: Empowerment-based Adversarial Inverse Reinforcement Learning
Open Source Code Yes 1Supplementary material is available at https://sites.google.com/view/eairl
Open Datasets Yes We evaluate our method against both state-of-the-art policy and reward learning techniques on several control tasks in Open AI Gym. ... For each algorithm, we provided 20 expert demonstrations generated by a policy trained on a ground-truth reward using TRPO (Schulman et al., 2015).
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits with percentages, absolute counts, or references to predefined splits for reproducibility.
Hardware Specification No The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, or cloud instance specifications) for running experiments.
Software Dependencies No The paper mentions software like 'TRPO (Schulman et al., 2015)' and 'PPO (Schulman et al., 2017)', but does not provide specific version numbers for these or any other ancillary software components.
Experiment Setup Yes For all experiments, we use the temperature term β = 1. We evaluated both mean-squared and absolute error forms of l I(s, a, s ) and found that both lead to similar performance in reward and policy learning. We set entropy regularization weight to 0.1 and 0.001 for reward and policy learning, respectively. The hyperparameter λI was set to 1.0 for reward learning and 0.001 for policy learning. The target parameters of the empowerment-based potential function Φϕ ( ) were updated every 5 and 2 epochs during reward and policy learning respectively. Furthermore, we set the batch size to 2000and 20000-steps per TRPO update for the pendulum and remaining environments, respectively. For the methods (Fu et al., 2017; Ho & Ermon, 2016) presented for comparison, we use their suggested hyperparameters. We also use policy samples from previous 20 iterations as negative data to train the discriminator of all IRL methods presented in this paper to prevent the parametrized reward functions from overfitting the current policy samples.