Hybrid Inverse Reinforcement Learning

Authors: Juntao Ren, Gokul Swamy, Steven Wu, Drew Bagnell, Sanjiban Choudhury

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we find that our approaches are significantly more sample efficient than standard inverse RL and several other baselines on a suite of continuous control tasks.
Researcher Affiliation Collaboration 1Cornell University 2Carnegie Mellon University 3Aurora Innovation.
Pseudocode Yes Algorithm 1 (Dual) IRL ( Ziebart et al. (2008b)), Algorithm 2 Hybrid Policy Emulation (Hy PE), Algorithm 3 Hybrid RL (Hy RL), Algorithm 4 Hybrid Policy Emulation w/ Resets (Hy PER)
Open Source Code Yes We release the code we used for all of our experiments at https://github.com/jren03/garage.
Open Datasets Yes On the Mu Jo Co locomotion benchmark environments (Brockman et al., 2016)... and Our next set of experiments consider the D4RL (Fu et al., 2020) antmaze-large environments
Dataset Splits No The paper mentions 'validation data' in algorithms 2 and 3 ('Return best of π1:T on validation data.' and 'Return Best of π1:N, Q1:N on validation data.') implying its use, but it does not specify the explicit split percentages, sample counts, or methodology for creating the validation dataset from the primary datasets used (MuJoCo or D4RL).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory specifications, or cloud computing instance types used for running the experiments.
Software Dependencies No The paper mentions several software components and libraries, such as Optimistic Adam, Soft Actor Critic (Haarnoja et al., 2018) implemented by Raffin et al. (2019), and TD3+BC (Fujimoto & Gu, 2021), but it does not provide specific version numbers for these software dependencies or the underlying frameworks like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We implement Hy PE by updating the policy and critic networks in Soft Actor Critic (Haarnoja et al., 2018) with expert and learner samples. We implement Hy PER by running model-based policy optimization (Janner et al., 2019) and resetting to expert states in the learned model. No reward information is provided in either case, so we also train a discriminator network. Appendix B includes additional implementation details and hyperparameters.