Inverse Reinforcement Learning without Reinforcement Learning

Authors: Gokul Swamy, David Wu, Sanjiban Choudhury, Drew Bagnell, Steven Wu

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In practice, we find that we are able to significantly speed up the prior art on continuous control tasks. We conduct experiments with the Py Bullet Suite (Coumans and Bai, 2016). We train experts using RL and then present all learners with 25 expert demonstrations to remove small-data concerns.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Cornell University 3Aurora Innovation. Correspondence to: Gokul Swamy <gswamy@cmu.edu>.
Pseudocode Yes Algorithm 1 IRL (Dual Version, Ziebart et al. (2008a)), Algorithm 2 IRL (Primal Version, Ho and Ermon (2016)), Algorithm 3 MMDP (Moment Matching by Dynamic Programming), Algorithm 4 NRMM(BR) (No-Regret Moment Matching: Best Response Variant)
Open Source Code Yes We release the code we used for all of our experiments at https://github.com/gkswamy98/fast_irl.
Open Datasets Yes We conduct experiments with the Py Bullet Suite (Coumans and Bai, 2016). We also conduct experiments on the antmaze-large tasks from Fu et al. (2020)
Dataset Splits No The paper mentions 'validation error' in general algorithms (Algorithm 1, 2) but does not provide specific details on training, validation, or test dataset splits for the experiments conducted using Py Bullet or D4RL.
Hardware Specification No The paper mentions a 'GPU award from NVIDIA' but does not provide specific details about the GPU model, CPU, memory, or other hardware used for the experiments.
Software Dependencies No The paper mentions using implementations like Soft Actor Critic (Haarnoja et al., 2018) provided by Raffin et al. (2019) and TD3+BC (Fujimoto and Gu, 2021), but it does not specify concrete software versions (e.g., PyTorch 1.9, Stable Baselines3 vX.Y) for replication.
Experiment Setup Yes Table 4 provides 'Expert and learner hyperparameters for SAC', including BUFFER SIZE (300000), BATCH SIZE (256), γ (0.98), τ (0.02), TRAINING FREQ. (64), GRADIENT STEPS (64), LEARNING RATE LIN. SCHED. (7.3E-4), POLICY ARCHITECTURE (256 X 2), STATE-DEPENDENT EXPLORATION (TRUE), TRAINING TIMESTEPS (1E6). Additionally, the text specifies 'Each outer loop iteration lasts for 5000 steps of environment interaction. We sample 4 trajectories to use in the discriminator update at the end of each outer-loop iteration.' and details like 'We use α = 0.5 for both variants of FILTER' and discriminator learning rates.