Hybrid Inverse Reinforcement Learning
Authors: Juntao Ren, Gokul Swamy, Steven Wu, Drew Bagnell, Sanjiban Choudhury
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we find that our approaches are significantly more sample efficient than standard inverse RL and several other baselines on a suite of continuous control tasks. |
| Researcher Affiliation | Collaboration | 1Cornell University 2Carnegie Mellon University 3Aurora Innovation. |
| Pseudocode | Yes | Algorithm 1 (Dual) IRL ( Ziebart et al. (2008b)), Algorithm 2 Hybrid Policy Emulation (Hy PE), Algorithm 3 Hybrid RL (Hy RL), Algorithm 4 Hybrid Policy Emulation w/ Resets (Hy PER) |
| Open Source Code | Yes | We release the code we used for all of our experiments at https://github.com/jren03/garage. |
| Open Datasets | Yes | On the Mu Jo Co locomotion benchmark environments (Brockman et al., 2016)... and Our next set of experiments consider the D4RL (Fu et al., 2020) antmaze-large environments |
| Dataset Splits | No | The paper mentions 'validation data' in algorithms 2 and 3 ('Return best of π1:T on validation data.' and 'Return Best of π1:N, Q1:N on validation data.') implying its use, but it does not specify the explicit split percentages, sample counts, or methodology for creating the validation dataset from the primary datasets used (MuJoCo or D4RL). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The paper mentions several software components and libraries, such as Optimistic Adam, Soft Actor Critic (Haarnoja et al., 2018) implemented by Raffin et al. (2019), and TD3+BC (Fujimoto & Gu, 2021), but it does not provide specific version numbers for these software dependencies or the underlying frameworks like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We implement Hy PE by updating the policy and critic networks in Soft Actor Critic (Haarnoja et al., 2018) with expert and learner samples. We implement Hy PER by running model-based policy optimization (Janner et al., 2019) and resetting to expert states in the learned model. No reward information is provided in either case, so we also train a discriminator network. Appendix B includes additional implementation details and hyperparameters. |