LS-IQ: Implicit Reward Regularization for Inverse Reinforcement Learning

Authors: Firas Al-Hafez, Davide Tateo, Oleg Arenz, Guoping Zhao, Jan Peters

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our method on six Mu Jo Co environments: Ant-v3, Walker2d-v3, Hopper-v3, Half Cheetah-v3, Humanoid-v3, and Atlas. The latter is a novel locomotion environment introduced by us and is further described in Appendix C.1. We select the following baselines: GAIL (Ho & Ermon, 2016), VAIL (Peng et al., 2019), IQ-Learn (Garg et al., 2021) and SQIL (Reddy et al., 2020).
Researcher Affiliation Academia Firas Al-Hafez1, Davide Tateo1, Oleg Arenz1, Guoping Zhao2, Jan Peters1,3 1 Intelligent Autonomous Systems, 2 Locomotion Laboratory 3 German Research Center for AI (DFKI), Centre for Cognitive Science, Hessian.AI TU Darmstadt, Germany {name.surname}@tu-darmstadt.de
Pseudocode Yes Algorithm 1 LS-IQ
Open Source Code Yes 1The code is available at https://github.com/robfiras/ls-iq
Open Datasets Yes We evaluate our method on six Mu Jo Co environments: Ant-v3, Walker2d-v3, Hopper-v3, Half Cheetah-v3, Humanoid-v3, and Atlas. ... The code for the environment as well as the expert data is available at https://github.com/ robfiras/ls-iq.
Dataset Splits No The paper states 'We use ten seeds and five expert trajectories for these experiments.' and mentions hyperparameter tuning ('we tune on each environment'), but it does not specify explicit validation dataset splits (e.g., percentages or counts) for hyperparameter selection or model evaluation in the main text.
Hardware Specification Yes Calculations for this research were conducted on the Lichtenberg high-performance computer of the TU Darmstadt.
Software Dependencies No For a fair comparison, all methods are implemented in the same framework, Mushroom RL (D Eramo et al., 2021). We verify that our implementations achieve comparable results to the original implementations by the authors. We use the hyperparameters proposed by the original authors for the respective environments and perform a grid search on novel environments.
Experiment Setup Yes We use the hyperparameters proposed by the original authors for the respective environments and perform a grid search on novel environments. ... For our method, we use the same hyperparameters as IQ-Learn, except for the regularizer coefficient c and the entropy coefficient β, which we tune on each environment. We only consider equal mixing, i.e., α = 0.5. ... We use ten seeds and five expert trajectories for these experiments.