Never Give Up: Learning Directed Exploration Strategies

Authors: Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martin Arjovsky, Alexander Pritzel, Andrew Bolt, Charles Blundell

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.
Researcher Affiliation Industry Deep Mind {adriap, psprechmann, avlife, danielguo, piot, skapturowski, tieleman, apritzel, abolt, cblundell}@google.com
Pseudocode Yes Algorithm 1: Computation of the episodic intrinsic reward at time t: repisodic t .
Open Source Code No The paper provides links to videos demonstrating the agent's behavior (e.g., 'See video of the trained agent here: https://youtu.be/9HTY4ru Pr Hw') but does not state that the source code for the methodology is openly available or provide a link to a code repository.
Open Datasets Yes We use standard Atari evaluation protocol and pre-processing as described in Tab. 8 of App. F.4, with the only difference being that we do not use frame stacking. We restrict NGU to using the same setting and data consumption as R2D2, the best performing algorithm on Atari (Kapturowski et al., 2019).
Dataset Splits No The paper refers to standard Atari evaluation protocols but does not explicitly define training, validation, and test splits with specific percentages or sample counts for reproduction.
Hardware Specification No The paper mentions using a 'single GPU-based learner' and '256 actors running in parallel', but does not specify the models of the GPUs or CPUs, or other detailed hardware specifications.
Software Dependencies No The paper mentions several software components and frameworks by name (e.g., 'Adam Optimizer', 'Recurrent Replay Distributed DQN (Kapturowski et al., 2019, R2D2)', 'pycolab game engine (Stepleton, 2017)'), but does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions.
Experiment Setup Yes We use standard Atari evaluation protocol and pre-processing as described in Tab. 8 of App. F.4. We also have a full list of hyperparameters for both common settings and Disco Maze in Appendix F, including specific values for learning rates, batch sizes, discount factors, and other architectural and training parameters, for example, 'Table 6: Common hyperparameters.'.