reproducibilityindex.ai

Learning Robust Rewards with Adverserial Inverse Reinforcement Learning

Authors: Justin Fu, Katie Luo, Sergey Levine

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, we aim to answer two questions: 1. Can AIRL learn disentangled rewards that are robust to changes in environment dynamics? 2. Is AIRL efﬁcient and scalable to high-dimensional continuous control tasks? To answer 1, we evaluate AIRL in transfer learning scenarios, where a reward is learned in a training environment, and optimized in a test environment with signiﬁcantly different dynamics. ... Numerical results for these environment transfer experiments are given in Table 1.
Researcher Affiliation	Academia	Justin Fu, Katie Luo, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, CA 94720, USA justinjfu@eecs.berkeley.edu,katieluo@berkeley.edu, svlevine@eecs.berkeley.edu
Pseudocode	Yes	Algorithm 1 Adversarial inverse reinforcement learning
Open Source Code	Yes	Our code and additional supplementary material including videos will be available at https://sites.google.com/view/ adversarial-irl
Open Datasets	Yes	Finally, we evaluate AIRL as an imitation learning algorithm against the GAN-GCL and the stateof-the-art GAIL on several benchmark tasks. Each algorithm is presented with 50 expert demonstrations, collected from a policy trained with TRPO on the ground truth reward function. ... Table 2: Results on imitation learning benchmark tasks. Mean scores (higher is better) are reported across 5 runs. Pendulum Ant Swimmer Half-Cheetah GAN-GCL -261.5 460.6 -10.6 -670.7 GAIL -226.0 1358.7 140.2 1642.8 AIRL (ours) -204.7 1238.6 139.1 1839.8 AIRL State Only (ours) -221.5 1089.3 136.4 891.9 Expert (TRPO) -179.6 1537.9 141.1 1811.2 Random -654.5 -108.1 -11.5 -988.4
Dataset Splits	No	The paper describes collecting 'expert demonstrations' and using 'training environment' and 'test environment' for transfer, but it does not specify a dedicated validation dataset split.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies	No	The paper mentions software components like 'trust region policy optimization' and 'soft value iteration' but does not specify their version numbers or other software dependencies with versions.
Experiment Setup	Yes	Entropy regularization: We use an entropy regularizer weight of 0.1 for Ant, Swimmer, and Half Cheetah across all methods. We use an entropy regularizer weight of 1.0 on the point mass environment. TRPO Batch Size: For Ant, Swimmer and Half Cheetah environments, we use a batch size of 10000 steps per TRPO update. For pendulum, we use a batch size of 2000.