Regularized Inverse Reinforcement Learning

Authors: Wonseok Jeon, Chen-Yang Su, Paul Barde, Thang Doan, Derek Nowrouzezahrai, Joelle Pineau

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose RAIRL a practical sampled-based IRL algorithm in regularized MDPs and evaluate its applicability on policy imitation (for discrete and continuous controls) and reward acquisition (for discrete control).
Researcher Affiliation Collaboration 1Mila, Quebec AI Institute 2School of Computer Science, Mc Gill University 3Facebook AI Research
Pseudocode Yes Algorithm 1: Regularized Adversarial Inverse Reinforcement Learning (RAIRL)
Open Source Code No The paper does not include an unambiguous statement about releasing source code for the described methodology or a direct link to a code repository.
Open Datasets Yes We validate RAIRL on Mu Jo Co continuous control tasks (Hopper-v2, Walker-v2, Half Cheetah-v2, Ant-v2)
Dataset Splits Yes During RAIRL s training (Figure 3, Top row), we use 1000 demonstrations sampled from the expert and periodically measure mean Bregman divergence, i.e., for DA Ω(p1||p2) = Ea p1[f φ(p2(a)) φ(p1(a))] Ea p2[f φ(p2(a)) φ(p2(a))], i=1 DA Ω(π( |si)||πE( |si)).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions 'SAC implementation from Rlpyt (Stooke & Abbeel, 2019)' and 'Mu Jo Co environments' but does not specify version numbers for these or any other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes J.5 HYPERPARAMETERS Tables 2, Table 3 and Table 4 list the parameters used in our Bandit, Bermuda World, and Mu Jo Co experiments, respectively. Hyper-parameter Bandit Batch size 500 Initial exploration steps 10,000 Replay size 500,000 Target update rate (τ) 0.0005 Learning rate 0.0005 λ 5 q (Tsallis entropy T k q ) 2.0 k (Tsallis entropy T k q ) 1.0 Number of trajectories 1,000 Reward learning rate 0.0005 Steps per update 50 Total environment steps 500,000