Intrinsic Reward Driven Imitation Learning via Generative Model

Authors: Xingrui Yu, Yueming Lyu, Ivor Tsang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results show that our method outperforms state-of-the-art IRL methods on multiple Atari games, even with one-life demonstration. Remarkably, our method achieves performance that is up to 5 times the performance of the demonstration.
Researcher Affiliation Academia 1Australian Artificial Intelligence Institute, University of Technology Sydney. Correspondence to: Xingrui Yu <Xingrui.Yu@student.uts.edu.au>, Ivor W. Tsang <Ivor.Tsang@uts.edu.au>.
Pseudocode Yes Algorithm 1 Generative Intrinsic Reward driven Imitation Learning (GIRIL)
Open Source Code Yes The implementation will be available online1. 1https://github.com/xingruiyu/GIRIL
Open Datasets Yes We evaluate our proposed GIRIL on one-life demonstration data for six Atari games within Open AI Gym (Brockman et al., 2016).
Dataset Splits No The paper uses generated one-life demonstrations for training and evaluates the agent's performance directly in the Atari environment. It does not specify train/validation/test dataset splits in the traditional sense for the demonstration data itself.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions software components like 'PPO implementation' and 'Adam optimizer' and refers to an OpenAI Gym environment, but it does not provide specific version numbers for these software dependencies (e.g., PyTorch version, Gym version).
Experiment Setup Yes Our first step was to train a reward learning module for each game on the one-life demonstration. ... Training was conducted with the Adam optimizer (Kingma & Ba, 2015) at a learning rate of 3e-5 and a mini-batch size of 32 for 50,000 epochs. ... We set α = 100 for training our reward learning module on Atari games. ... We trained the PPO on the learned reward function for 50 million simulation steps to obtain our final policy. The PPO is trained with a learning rate of 2.5e-4, a clipping threshold of 0.1, an entropy coefficient of 0.01, a value function coefficient of 0.5, and a GAE parameter of 0.95 (Schulman et al., 2016).