Exploration by random network distillation
Authors: Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma s Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review |
| Pseudocode | Yes | Algorithm 1 RND pseudo-code |
| Open Source Code | Yes | Exact details of the method can be found in the code accompanying this paper (goo.gl/DGPC8E). |
| Open Datasets | No | Atari games have been a standard benchmark for deep reinforcement learning algorithms since the pioneering work by Mnih et al. (2013). Bellemare et al. (2016) identified among these games the hard exploration games with sparse rewards: Freeway, Gravitar, Montezuma s Revenge, Pitfall!, Private Eye, Solaris, and Venture. The paper uses these well-known benchmarks but does not provide specific links or access information to the dataset files themselves. |
| Dataset Splits | No | The paper describes experiments in a reinforcement learning setting where data is generated through interaction with the environment (Atari games). It does not specify fixed training, validation, and test dataset splits as would be typical for supervised learning tasks. While it mentions 'training data' and 'test examples' in the context of a toy MNIST example, this is not applicable to the main Atari experiments. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, memory, or cloud instances) used to run the experiments. |
| Software Dependencies | No | The paper mentions using PPO (Schulman et al., 2017) as the policy optimization algorithm and Adam (Kingma & Ba (2015)) for optimization. However, it does not specify software dependencies with version numbers (e.g., specific deep learning frameworks like TensorFlow or PyTorch, or their versions). |
| Experiment Setup | Yes | For details of hyperparameters and architectures we refer the reader to Appendices A.3 and A.4. Most experiments are run for 30K rollouts of length 128 per environment with 128 parallel environments, for a total of 1.97 billion frames of experience. Table 4: Default hyperparameters for PPO and RND algorithms for experiments where applicable. |