Regularized Inverse Reinforcement Learning
Authors: Wonseok Jeon, Chen-Yang Su, Paul Barde, Thang Doan, Derek Nowrouzezahrai, Joelle Pineau
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose RAIRL a practical sampled-based IRL algorithm in regularized MDPs and evaluate its applicability on policy imitation (for discrete and continuous controls) and reward acquisition (for discrete control). |
| Researcher Affiliation | Collaboration | 1Mila, Quebec AI Institute 2School of Computer Science, Mc Gill University 3Facebook AI Research |
| Pseudocode | Yes | Algorithm 1: Regularized Adversarial Inverse Reinforcement Learning (RAIRL) |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We validate RAIRL on Mu Jo Co continuous control tasks (Hopper-v2, Walker-v2, Half Cheetah-v2, Ant-v2) |
| Dataset Splits | Yes | During RAIRL s training (Figure 3, Top row), we use 1000 demonstrations sampled from the expert and periodically measure mean Bregman divergence, i.e., for DA Ω(p1||p2) = Ea p1[f φ(p2(a)) φ(p1(a))] Ea p2[f φ(p2(a)) φ(p2(a))], i=1 DA Ω(π( |si)||πE( |si)). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions 'SAC implementation from Rlpyt (Stooke & Abbeel, 2019)' and 'Mu Jo Co environments' but does not specify version numbers for these or any other software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | J.5 HYPERPARAMETERS Tables 2, Table 3 and Table 4 list the parameters used in our Bandit, Bermuda World, and Mu Jo Co experiments, respectively. Hyper-parameter Bandit Batch size 500 Initial exploration steps 10,000 Replay size 500,000 Target update rate (τ) 0.0005 Learning rate 0.0005 λ 5 q (Tsallis entropy T k q ) 2.0 k (Tsallis entropy T k q ) 1.0 Number of trajectories 1,000 Reward learning rate 0.0005 Steps per update 50 Total environment steps 500,000 |