Rethinking Inverse Reinforcement Learning: from Data Alignment to Task Alignment

Authors: Weichao Zhou, Wenchao Li

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that our framework outperforms conventional IL baselines in complex and transfer learning scenarios. The complete code are available at https://github.com/zwc662/PAGAR. ... Experimental results show that our algorithm outperforms baselines on complex IL tasks with limited demonstrations and in challenging transfer environments.
Researcher Affiliation Academia Weichao Zhou Boston University Boston, MA 02215 zwc662@bu.edu Wenchao Li Boston University Boston, MA 02215 wenchao@bu.edu
Pseudocode Yes Algorithm 1 An Meta-Algorithm for Imitation Learning with PAGAR Input: Expert demonstration set E, IRL objective function JIRL, loss bound δ, parameter λ 0, initial protagonist policy πP , antagonist policy πA, reward function r, maximum iteration number N. Output: πP 1: for iteration i = 0, 1, . . . , N do 2: Sample trajectory sets DA πA and DP πP 3: Optimize πA: estimate JRL(πA; r) with DA; update πA to maximize JRL(πA; r) 4: Optimize πP : estimate JRL(πP ; r) with DP ; estimate JπA(πP ; r) with DA; update πA to maximize JRL(πP ; r) + JπA(πP ; r) 5: Optimize r: estimate JP AGAR(r; πP , πA) with DP and DA; estimate JIRL(r) with DA and E; update r to minimize JP AGAR(r; πP , πA)+λ (δ JIRL(r)); then update λ to maximize δ JIRL(r) 6: end for 7: return πP
Open Source Code Yes Our experimental results show that our framework outperforms conventional IL baselines in complex and transfer learning scenarios. The complete code are available at https://github.com/zwc662/PAGAR.
Open Datasets Yes Our benchmarks include two discrete domain tasks from the Mini-Grid environments Chevalier-Boisvert et al. [2023]: Door Key-6x6-v0, and Simple Crossing S9N1-v0. ... We use four continuous control environments from Mujoco. In the online RL setting, both protagonist and antagonist policies are permitted to explore the environment. In the offline RL setting, exploration by these policies is restricted. Especially, for offline RL we use the D4RL s expert datasets as the expert demonstrations and the random datasets as the offline suboptimal dataset.
Dataset Splits No The paper does not explicitly mention train/validation/test splits by percentages or counts. It refers to 'offline suboptimal dataset' for D4RL but does not specify how it's partitioned for training/validation/testing beyond implied use for offline RL training.
Hardware Specification Yes All experiments are carried out on a quad-core i7-7700K processor running at 3.6 GHz with a NVIDIA Ge Force GTX 1050 Ti GPU and a 16 GB of memory.
Software Dependencies No The paper mentions PPO Schulman et al. [2017] is used for policy training and refers to libraries like GANs, but it does not specify version numbers for these software components or Python itself.
Experiment Setup Yes The hyperparameters that appear in Algorithm 3 and 3 are summarized in Table 2 where we use N/A to indicate using δ , in which case we let µ = 0. Otherwise, the values of µ and δ vary depending on the task and IRL algorithm. The parameter λ0 is the initial value of λ as explained in Appendix B.4. ... Table 2: Hyperparameters used in the training processes