LS-IQ: Implicit Reward Regularization for Inverse Reinforcement Learning
Authors: Firas Al-Hafez, Davide Tateo, Oleg Arenz, Guoping Zhao, Jan Peters
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our method on six Mu Jo Co environments: Ant-v3, Walker2d-v3, Hopper-v3, Half Cheetah-v3, Humanoid-v3, and Atlas. The latter is a novel locomotion environment introduced by us and is further described in Appendix C.1. We select the following baselines: GAIL (Ho & Ermon, 2016), VAIL (Peng et al., 2019), IQ-Learn (Garg et al., 2021) and SQIL (Reddy et al., 2020). |
| Researcher Affiliation | Academia | Firas Al-Hafez1, Davide Tateo1, Oleg Arenz1, Guoping Zhao2, Jan Peters1,3 1 Intelligent Autonomous Systems, 2 Locomotion Laboratory 3 German Research Center for AI (DFKI), Centre for Cognitive Science, Hessian.AI TU Darmstadt, Germany {name.surname}@tu-darmstadt.de |
| Pseudocode | Yes | Algorithm 1 LS-IQ |
| Open Source Code | Yes | 1The code is available at https://github.com/robfiras/ls-iq |
| Open Datasets | Yes | We evaluate our method on six Mu Jo Co environments: Ant-v3, Walker2d-v3, Hopper-v3, Half Cheetah-v3, Humanoid-v3, and Atlas. ... The code for the environment as well as the expert data is available at https://github.com/ robfiras/ls-iq. |
| Dataset Splits | No | The paper states 'We use ten seeds and five expert trajectories for these experiments.' and mentions hyperparameter tuning ('we tune on each environment'), but it does not specify explicit validation dataset splits (e.g., percentages or counts) for hyperparameter selection or model evaluation in the main text. |
| Hardware Specification | Yes | Calculations for this research were conducted on the Lichtenberg high-performance computer of the TU Darmstadt. |
| Software Dependencies | No | For a fair comparison, all methods are implemented in the same framework, Mushroom RL (D Eramo et al., 2021). We verify that our implementations achieve comparable results to the original implementations by the authors. We use the hyperparameters proposed by the original authors for the respective environments and perform a grid search on novel environments. |
| Experiment Setup | Yes | We use the hyperparameters proposed by the original authors for the respective environments and perform a grid search on novel environments. ... For our method, we use the same hyperparameters as IQ-Learn, except for the regularizer coefficient c and the entropy coefficient β, which we tune on each environment. We only consider equal mixing, i.e., α = 0.5. ... We use ten seeds and five expert trajectories for these experiments. |