Intrinsic Reward Driven Imitation Learning via Generative Model
Authors: Xingrui Yu, Yueming Lyu, Ivor Tsang
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that our method outperforms state-of-the-art IRL methods on multiple Atari games, even with one-life demonstration. Remarkably, our method achieves performance that is up to 5 times the performance of the demonstration. |
| Researcher Affiliation | Academia | 1Australian Artificial Intelligence Institute, University of Technology Sydney. Correspondence to: Xingrui Yu <Xingrui.Yu@student.uts.edu.au>, Ivor W. Tsang <Ivor.Tsang@uts.edu.au>. |
| Pseudocode | Yes | Algorithm 1 Generative Intrinsic Reward driven Imitation Learning (GIRIL) |
| Open Source Code | Yes | The implementation will be available online1. 1https://github.com/xingruiyu/GIRIL |
| Open Datasets | Yes | We evaluate our proposed GIRIL on one-life demonstration data for six Atari games within Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper uses generated one-life demonstrations for training and evaluates the agent's performance directly in the Atari environment. It does not specify train/validation/test dataset splits in the traditional sense for the demonstration data itself. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'PPO implementation' and 'Adam optimizer' and refers to an OpenAI Gym environment, but it does not provide specific version numbers for these software dependencies (e.g., PyTorch version, Gym version). |
| Experiment Setup | Yes | Our first step was to train a reward learning module for each game on the one-life demonstration. ... Training was conducted with the Adam optimizer (Kingma & Ba, 2015) at a learning rate of 3e-5 and a mini-batch size of 32 for 50,000 epochs. ... We set α = 100 for training our reward learning module on Atari games. ... We trained the PPO on the learned reward function for 50 million simulation steps to obtain our final policy. The PPO is trained with a learning rate of 2.5e-4, a clipping threshold of 0.1, an entropy coefficient of 0.01, a value function coefficient of 0.5, and a GAE parameter of 0.95 (Schulman et al., 2016). |