Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Exploration by random network distillation

Authors: Yuri Burda, Harrison Edwards, Amos Storkey, Oleg Klimov

ICLR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that the random network distillation (RND) bonus combined with this increased flexibility enables significant progress on several hard exploration Atari games. In particular we establish state of the art performance on Montezuma s Revenge, a game famously difficult for deep reinforcement learning methods. To the best of our knowledge, this is the first method that achieves better than average human performance on this game without using demonstrations or having access to the underlying state of the game, and occasionally completes the first level.
Researcher Affiliation Academia Anonymous authors Paper under double-blind review
Pseudocode Yes Algorithm 1 RND pseudo-code
Open Source Code Yes Exact details of the method can be found in the code accompanying this paper (goo.gl/DGPC8E).
Open Datasets No Atari games have been a standard benchmark for deep reinforcement learning algorithms since the pioneering work by Mnih et al. (2013). Bellemare et al. (2016) identified among these games the hard exploration games with sparse rewards: Freeway, Gravitar, Montezuma s Revenge, Pitfall!, Private Eye, Solaris, and Venture. The paper uses these well-known benchmarks but does not provide specific links or access information to the dataset files themselves.
Dataset Splits No The paper describes experiments in a reinforcement learning setting where data is generated through interaction with the environment (Atari games). It does not specify fixed training, validation, and test dataset splits as would be typical for supervised learning tasks. While it mentions 'training data' and 'test examples' in the context of a toy MNIST example, this is not applicable to the main Atari experiments.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, memory, or cloud instances) used to run the experiments.
Software Dependencies No The paper mentions using PPO (Schulman et al., 2017) as the policy optimization algorithm and Adam (Kingma & Ba (2015)) for optimization. However, it does not specify software dependencies with version numbers (e.g., specific deep learning frameworks like TensorFlow or PyTorch, or their versions).
Experiment Setup Yes For details of hyperparameters and architectures we refer the reader to Appendices A.3 and A.4. Most experiments are run for 30K rollouts of length 128 per environment with 128 parallel environments, for a total of 1.97 billion frames of experience. Table 4: Default hyperparameters for PPO and RND algorithms for experiments where applicable.