Robust Imitation via Mirror Descent Inverse Reinforcement Learning
Authors: Dong-Sig Han, Hyunseo Kim, Hyundo Lee, JeHwan Ryu, Byoung-Tak Zhang
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our IRL method was applied on top of an adversarial framework, and it outperformed existing adversarial methods in an extensive suite of benchmarks. |
| Researcher Affiliation | Academia | Artificial Intelligence Institute, Seoul National University {dshan, hskim, hdlee, jhryu, btzhang}@bi.snu.ac.kr |
| Pseudocode | Yes | Algorithm 1 Mirror Descent Adversarial Inverse Reinforcement Learning. |
| Open Source Code | No | Our empirical studies can be reproduced by from the detailed information in Appendices B and C. (This statement refers to 'detailed information' for reproduction, not explicitly to open-source code being provided via a link or in supplementary materials itself. Without checking the appendices, the main text does not provide concrete access to the source code.) |
| Open Datasets | Yes | Mu Jo Co [19] benchmarks and The Mu Jo Co simulator used in our experiments is freely available to everyone. See the site (https://mujoco.org). |
| Dataset Splits | No | The paper discusses training with expert demonstrations and different numbers of episodes but does not provide explicit details on train/validation/test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | Yes | In experiments, each algorithm was executed in CPU (a single thread). |
| Software Dependencies | No | The paper mentions software like RAC, SAC, and TensorFlow, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Input: trajectories {τ t }T t=1, an agent πθ, a reference policy πν, a neural network dξ:S R, a regularized reward function ψφ ΨΩ(Π), α1,αT , and λ. and Fig. 5 shows that the Bregman divergence was large for MD-AIRL at the early training phase, because we chose the initial step size η1 to be greater than 1 (α1 = 0.5). and MD-AIRL outperformed RAIRL in four cases by choosing an effectively low step size at ηT to be less than 1 (αT = 2). |