Never Give Up: Learning Directed Exploration Strategies
Authors: Adrià Puigdomènech Badia, Pablo Sprechmann, Alex Vitvitskyi, Daniel Guo, Bilal Piot, Steven Kapturowski, Olivier Tieleman, Martin Arjovsky, Alexander Pritzel, Andrew Bolt, Charles Blundell
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features. |
| Researcher Affiliation | Industry | Deep Mind {adriap, psprechmann, avlife, danielguo, piot, skapturowski, tieleman, apritzel, abolt, cblundell}@google.com |
| Pseudocode | Yes | Algorithm 1: Computation of the episodic intrinsic reward at time t: repisodic t . |
| Open Source Code | No | The paper provides links to videos demonstrating the agent's behavior (e.g., 'See video of the trained agent here: https://youtu.be/9HTY4ru Pr Hw') but does not state that the source code for the methodology is openly available or provide a link to a code repository. |
| Open Datasets | Yes | We use standard Atari evaluation protocol and pre-processing as described in Tab. 8 of App. F.4, with the only difference being that we do not use frame stacking. We restrict NGU to using the same setting and data consumption as R2D2, the best performing algorithm on Atari (Kapturowski et al., 2019). |
| Dataset Splits | No | The paper refers to standard Atari evaluation protocols but does not explicitly define training, validation, and test splits with specific percentages or sample counts for reproduction. |
| Hardware Specification | No | The paper mentions using a 'single GPU-based learner' and '256 actors running in parallel', but does not specify the models of the GPUs or CPUs, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions several software components and frameworks by name (e.g., 'Adam Optimizer', 'Recurrent Replay Distributed DQN (Kapturowski et al., 2019, R2D2)', 'pycolab game engine (Stepleton, 2017)'), but does not provide specific version numbers for these software dependencies, which are necessary for reproducible descriptions. |
| Experiment Setup | Yes | We use standard Atari evaluation protocol and pre-processing as described in Tab. 8 of App. F.4. We also have a full list of hyperparameters for both common settings and Disco Maze in Appendix F, including specific values for learning rates, batch sizes, discount factors, and other architectural and training parameters, for example, 'Table 6: Common hyperparameters.'. |