reproducibilityindex.ai

Adversarial Imitation via Variational Inverse Reinforcement Learning

Authors: Ahmed H. Qureshi, Byron Boots, Michael C. Yip

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on various high-dimensional complex control tasks. We also test our learned rewards in challenging transfer learning problems where training and testing environments are made to be different from each other in terms of dynamics or structure. The results show that our proposed method not only learns nearoptimal rewards and policies that are matching expert behavior but also performs signiﬁcantly better than state-of-the-art inverse reinforcement learning algorithms.
Researcher Affiliation	Academia	Ahmed H. Qureshi Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA a1qureshi@ucsd.edu Byron Boots College of Computing Georgia Institute of Technology Atlanta, GA 30332, USA bboots@cc.gatech.edu Michael C. Yip Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 92093, USA yip@ucsd.edu
Pseudocode	Yes	Algorithm 1: Empowerment-based Adversarial Inverse Reinforcement Learning
Open Source Code	Yes	1Supplementary material is available at https://sites.google.com/view/eairl
Open Datasets	Yes	We evaluate our method against both state-of-the-art policy and reward learning techniques on several control tasks in Open AI Gym. ... For each algorithm, we provided 20 expert demonstrations generated by a policy trained on a ground-truth reward using TRPO (Schulman et al., 2015).
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits with percentages, absolute counts, or references to predefined splits for reproducibility.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used (e.g., GPU models, CPU types, or cloud instance specifications) for running experiments.
Software Dependencies	No	The paper mentions software like 'TRPO (Schulman et al., 2015)' and 'PPO (Schulman et al., 2017)', but does not provide specific version numbers for these or any other ancillary software components.
Experiment Setup	Yes	For all experiments, we use the temperature term β = 1. We evaluated both mean-squared and absolute error forms of l I(s, a, s ) and found that both lead to similar performance in reward and policy learning. We set entropy regularization weight to 0.1 and 0.001 for reward and policy learning, respectively. The hyperparameter λI was set to 1.0 for reward learning and 0.001 for policy learning. The target parameters of the empowerment-based potential function Φϕ ( ) were updated every 5 and 2 epochs during reward and policy learning respectively. Furthermore, we set the batch size to 2000and 20000-steps per TRPO update for the pendulum and remaining environments, respectively. For the methods (Fu et al., 2017; Ho & Ermon, 2016) presented for comparison, we use their suggested hyperparameters. We also use policy samples from previous 20 iterations as negative data to train the discriminator of all IRL methods presented in this paper to prevent the parametrized reward functions from overﬁtting the current policy samples.