Self-Imitation Learning
Authors: Junhyuk Oh, Yijie Guo, Satinder Singh, Honglak Lee
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results show that SIL significantly improves advantage actor-critic (A2C) on several hard exploration Atari games and is competitive to the state-of-the-art count-based exploration methods. We also show that SIL improves proximal policy optimization (PPO) on Mu Jo Co tasks. |
| Researcher Affiliation | Collaboration | 1University of Michigan 2Google Brain. Correspondence to: Junhyuk Oh <junhyuk@umich.edu>, Yijie Guo <guoyijie@umich.com>. |
| Pseudocode | Yes | Algorithm 1 Actor-Critic with Self-Imitation Learning |
| Open Source Code | Yes | The code is available on https://github.com/ junhyukoh/self-imitation-learning. |
| Open Datasets | Yes | For Atari experiments, we used a 3-layer convolutional neural network used in DQN (Mnih et al., 2015) with last 4 stacked frames as input. ... on several hard exploration Atari games (Bellemare et al., 2013). ... Finally, SIL improves the performance of proximal policy optimization (PPO) on Mu Jo Co continuous control tasks (Brockman et al., 2016; Todorov et al., 2012). |
| Dataset Splits | No | The paper does not explicitly provide specific percentages, sample counts, or citations to predefined train/validation/test splits for the datasets used. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running experiments, such as CPU or GPU models, memory, or cloud instance types. |
| Software Dependencies | No | The paper mentions that their implementation is "based on Open AI s baseline implementation (Dhariwal et al., 2017)" but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | For Atari experiments, we used a 3-layer convolutional neural network used in DQN (Mnih et al., 2015) with last 4 stacked frames as input. We performed 4 self-imitation learning updates per on-policy actor-critic update (M = 4 in Algorithm 1). ... For Mu Jo Co experiments, we used an MLP which consists of 2 hidden layers with 64 units as in Schulman et al. (2017b). We performed 10 self-imitation learning updates per each iteration (batch). |