reproducibilityindex.ai

Intrinsic Reward Driven Imitation Learning via Generative Model

Authors: Xingrui Yu, Yueming Lyu, Ivor Tsang

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results show that our method outperforms state-of-the-art IRL methods on multiple Atari games, even with one-life demonstration. Remarkably, our method achieves performance that is up to 5 times the performance of the demonstration.
Researcher Affiliation	Academia	1Australian Artiﬁcial Intelligence Institute, University of Technology Sydney. Correspondence to: Xingrui Yu <Xingrui.Yu@student.uts.edu.au>, Ivor W. Tsang <Ivor.Tsang@uts.edu.au>.
Pseudocode	Yes	Algorithm 1 Generative Intrinsic Reward driven Imitation Learning (GIRIL)
Open Source Code	Yes	The implementation will be available online1. 1https://github.com/xingruiyu/GIRIL
Open Datasets	Yes	We evaluate our proposed GIRIL on one-life demonstration data for six Atari games within Open AI Gym (Brockman et al., 2016).
Dataset Splits	No	The paper uses generated one-life demonstrations for training and evaluates the agent's performance directly in the Atari environment. It does not specify train/validation/test dataset splits in the traditional sense for the demonstration data itself.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions software components like 'PPO implementation' and 'Adam optimizer' and refers to an OpenAI Gym environment, but it does not provide specific version numbers for these software dependencies (e.g., PyTorch version, Gym version).
Experiment Setup	Yes	Our ﬁrst step was to train a reward learning module for each game on the one-life demonstration. ... Training was conducted with the Adam optimizer (Kingma & Ba, 2015) at a learning rate of 3e-5 and a mini-batch size of 32 for 50,000 epochs. ... We set α = 100 for training our reward learning module on Atari games. ... We trained the PPO on the learned reward function for 50 million simulation steps to obtain our ﬁnal policy. The PPO is trained with a learning rate of 2.5e-4, a clipping threshold of 0.1, an entropy coefﬁcient of 0.01, a value function coefﬁcient of 0.5, and a GAE parameter of 0.95 (Schulman et al., 2016).