reproducibilityindex.ai

Regularized Inverse Reinforcement Learning

Authors: Wonseok Jeon, Chen-Yang Su, Paul Barde, Thang Doan, Derek Nowrouzezahrai, Joelle Pineau

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose RAIRL a practical sampled-based IRL algorithm in regularized MDPs and evaluate its applicability on policy imitation (for discrete and continuous controls) and reward acquisition (for discrete control).
Researcher Affiliation	Collaboration	1Mila, Quebec AI Institute 2School of Computer Science, Mc Gill University 3Facebook AI Research
Pseudocode	Yes	Algorithm 1: Regularized Adversarial Inverse Reinforcement Learning (RAIRL)
Open Source Code	No	The paper does not include an unambiguous statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets	Yes	We validate RAIRL on Mu Jo Co continuous control tasks (Hopper-v2, Walker-v2, Half Cheetah-v2, Ant-v2)
Dataset Splits	Yes	During RAIRL s training (Figure 3, Top row), we use 1000 demonstrations sampled from the expert and periodically measure mean Bregman divergence, i.e., for DA Ω(p1\|\|p2) = Ea p1[f φ(p2(a)) φ(p1(a))] Ea p2[f φ(p2(a)) φ(p2(a))], i=1 DA Ω(π( \|si)\|\|πE( \|si)).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies	No	The paper mentions 'SAC implementation from Rlpyt (Stooke & Abbeel, 2019)' and 'Mu Jo Co environments' but does not specify version numbers for these or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup	Yes	J.5 HYPERPARAMETERS Tables 2, Table 3 and Table 4 list the parameters used in our Bandit, Bermuda World, and Mu Jo Co experiments, respectively. Hyper-parameter Bandit Batch size 500 Initial exploration steps 10,000 Replay size 500,000 Target update rate (τ) 0.0005 Learning rate 0.0005 λ 5 q (Tsallis entropy T k q ) 2.0 k (Tsallis entropy T k q ) 1.0 Number of trajectories 1,000 Reward learning rate 0.0005 Steps per update 50 Total environment steps 500,000