Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Regularized Inverse Reinforcement Learning
Authors: Wonseok Jeon, Chen-Yang Su, Paul Barde, Thang Doan, Derek Nowrouzezahrai, Joelle Pineau
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose RAIRL a practical sampled-based IRL algorithm in regularized MDPs and evaluate its applicability on policy imitation (for discrete and continuous controls) and reward acquisition (for discrete control). |
| Researcher Affiliation | Collaboration | 1Mila, Quebec AI Institute 2School of Computer Science, Mc Gill University 3Facebook AI Research |
| Pseudocode | Yes | Algorithm 1: Regularized Adversarial Inverse Reinforcement Learning (RAIRL) |
| Open Source Code | No | The paper does not include an unambiguous statement about releasing source code for the described methodology or a direct link to a code repository. |
| Open Datasets | Yes | We validate RAIRL on Mu Jo Co continuous control tasks (Hopper-v2, Walker-v2, Half Cheetah-v2, Ant-v2) |
| Dataset Splits | Yes | During RAIRL s training (Figure 3, Top row), we use 1000 demonstrations sampled from the expert and periodically measure mean Bregman divergence, i.e., for DA Ω(p1||p2) = Ea p1[f φ(p2(a)) φ(p1(a))] Ea p2[f φ(p2(a)) φ(p2(a))], i=1 DA Ω(π( |si)||πE( |si)). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper mentions 'SAC implementation from Rlpyt (Stooke & Abbeel, 2019)' and 'Mu Jo Co environments' but does not specify version numbers for these or any other software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | J.5 HYPERPARAMETERS Tables 2, Table 3 and Table 4 list the parameters used in our Bandit, Bermuda World, and Mu Jo Co experiments, respectively. Hyper-parameter Bandit Batch size 500 Initial exploration steps 10,000 Replay size 500,000 Target update rate (τ) 0.0005 Learning rate 0.0005 λ 5 q (Tsallis entropy T k q ) 2.0 k (Tsallis entropy T k q ) 1.0 Number of trajectories 1,000 Reward learning rate 0.0005 Steps per update 50 Total environment steps 500,000 |