Learning Robust Rewards with Adverserial Inverse Reinforcement Learning
Authors: Justin Fu, Katie Luo, Sergey Levine
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we aim to answer two questions: 1. Can AIRL learn disentangled rewards that are robust to changes in environment dynamics? 2. Is AIRL efficient and scalable to high-dimensional continuous control tasks? To answer 1, we evaluate AIRL in transfer learning scenarios, where a reward is learned in a training environment, and optimized in a test environment with significantly different dynamics. ... Numerical results for these environment transfer experiments are given in Table 1. |
| Researcher Affiliation | Academia | Justin Fu, Katie Luo, Sergey Levine Department of Electrical Engineering and Computer Science University of California, Berkeley Berkeley, CA 94720, USA justinjfu@eecs.berkeley.edu,katieluo@berkeley.edu, svlevine@eecs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Adversarial inverse reinforcement learning |
| Open Source Code | Yes | Our code and additional supplementary material including videos will be available at https://sites.google.com/view/ adversarial-irl |
| Open Datasets | Yes | Finally, we evaluate AIRL as an imitation learning algorithm against the GAN-GCL and the stateof-the-art GAIL on several benchmark tasks. Each algorithm is presented with 50 expert demonstrations, collected from a policy trained with TRPO on the ground truth reward function. ... Table 2: Results on imitation learning benchmark tasks. Mean scores (higher is better) are reported across 5 runs. Pendulum Ant Swimmer Half-Cheetah GAN-GCL -261.5 460.6 -10.6 -670.7 GAIL -226.0 1358.7 140.2 1642.8 AIRL (ours) -204.7 1238.6 139.1 1839.8 AIRL State Only (ours) -221.5 1089.3 136.4 891.9 Expert (TRPO) -179.6 1537.9 141.1 1811.2 Random -654.5 -108.1 -11.5 -988.4 |
| Dataset Splits | No | The paper describes collecting 'expert demonstrations' and using 'training environment' and 'test environment' for transfer, but it does not specify a dedicated validation dataset split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions software components like 'trust region policy optimization' and 'soft value iteration' but does not specify their version numbers or other software dependencies with versions. |
| Experiment Setup | Yes | Entropy regularization: We use an entropy regularizer weight of 0.1 for Ant, Swimmer, and Half Cheetah across all methods. We use an entropy regularizer weight of 1.0 on the point mass environment. TRPO Batch Size: For Ant, Swimmer and Half Cheetah environments, we use a batch size of 10000 steps per TRPO update. For pendulum, we use a batch size of 2000. |