Learning Robust Rewards with Adverserial Inverse Reinforcement Learning

Authors: Justin Fu, Katie Luo, Sergey Levine

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we aim to answer two questions: 1. Can AIRL learn disentangled rewards that are robust to changes in environment dynamics? 2. Is AIRL efficient and scalable to high-dimensional continuous control tasks? To answer 1, we evaluate AIRL in transfer learning scenarios, where a reward is learned in a training environment, and optimized in a test environment with significantly different dynamics. ... Numerical results for these environment transfer experiments are given in Table 1.
Researcher Affiliation Academia Justin Fu, Katie Luo, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, CA 94720, USA justinjfu@eecs.berkeley.edu,katieluo@berkeley.edu, svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 Adversarial inverse reinforcement learning
Open Source Code Yes Our code and additional supplementary material including videos will be available at https://sites.google.com/view/ adversarial-irl
Open Datasets Yes Finally, we evaluate AIRL as an imitation learning algorithm against the GAN-GCL and the stateof-the-art GAIL on several benchmark tasks. Each algorithm is presented with 50 expert demonstrations, collected from a policy trained with TRPO on the ground truth reward function. ... Table 2: Results on imitation learning benchmark tasks. Mean scores (higher is better) are reported across 5 runs. Pendulum Ant Swimmer Half-Cheetah GAN-GCL -261.5 460.6 -10.6 -670.7 GAIL -226.0 1358.7 140.2 1642.8 AIRL (ours) -204.7 1238.6 139.1 1839.8 AIRL State Only (ours) -221.5 1089.3 136.4 891.9 Expert (TRPO) -179.6 1537.9 141.1 1811.2 Random -654.5 -108.1 -11.5 -988.4
Dataset Splits No The paper describes collecting 'expert demonstrations' and using 'training environment' and 'test environment' for transfer, but it does not specify a dedicated validation dataset split.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies No The paper mentions software components like 'trust region policy optimization' and 'soft value iteration' but does not specify their version numbers or other software dependencies with versions.
Experiment Setup Yes Entropy regularization: We use an entropy regularizer weight of 0.1 for Ant, Swimmer, and Half Cheetah across all methods. We use an entropy regularizer weight of 1.0 on the point mass environment. TRPO Batch Size: For Ant, Swimmer and Half Cheetah environments, we use a batch size of 10000 steps per TRPO update. For pendulum, we use a batch size of 2000.