reproducibilityindex.ai

Exploration by random network distillation

Authors: Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁnd that the random network distillation (RND) bonus combined with this increased ﬂexibility enables signiﬁcant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma s Revenge, a game famously difﬁcult for deep reinforcement learning methods. To the best of our knowledge, this is the ﬁrst method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the ﬁrst level.
Researcher Affiliation	Academia	Anonymous authors Paper under double-blind review
Pseudocode	Yes	Algorithm 1 RND pseudo-code
Open Source Code	Yes	Exact details of the method can be found in the code accompanying this paper (goo.gl/DGPC8E).
Open Datasets	No	Atari games have been a standard benchmark for deep reinforcement learning algorithms since the pioneering work by Mnih et al. (2013). Bellemare et al. (2016) identiﬁed among these games the hard exploration games with sparse rewards: Freeway, Gravitar, Montezuma s Revenge, Pitfall!, Private Eye, Solaris, and Venture. The paper uses these well-known benchmarks but does not provide specific links or access information to the dataset files themselves.
Dataset Splits	No	The paper describes experiments in a reinforcement learning setting where data is generated through interaction with the environment (Atari games). It does not specify fixed training, validation, and test dataset splits as would be typical for supervised learning tasks. While it mentions 'training data' and 'test examples' in the context of a toy MNIST example, this is not applicable to the main Atari experiments.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, memory, or cloud instances) used to run the experiments.
Software Dependencies	No	The paper mentions using PPO (Schulman et al., 2017) as the policy optimization algorithm and Adam (Kingma & Ba (2015)) for optimization. However, it does not specify software dependencies with version numbers (e.g., specific deep learning frameworks like TensorFlow or PyTorch, or their versions).
Experiment Setup	Yes	For details of hyperparameters and architectures we refer the reader to Appendices A.3 and A.4. Most experiments are run for 30K rollouts of length 128 per environment with 128 parallel environments, for a total of 1.97 billion frames of experience. Table 4: Default hyperparameters for PPO and RND algorithms for experiments where applicable.