Human-level Atari 200x faster

Authors: Steven Kapturowski, VĂ­ctor Campos, Ray Jiang, Nemanja Rakicevic, Hado van Hasselt, Charles Blundell, Adria Puigdomenech Badia

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a novel agent that we call MEME, for MEME is an Efficient Memory-based Exploration agent, which introduces solutions to enable taking advantage of three approaches that would otherwise lead to instabilities... Our agent outperforms the human baseline across all 57 Atari games in 390M frames, using two orders of magnitude fewer interactions with the environment than Agent57 as shown in Figure 1. We also demonstrate competitive performance with high-performing methods such as Muesli and Mu Zero. We analyze the contribution of all the components introduced in Section 4 through ablation experiments on the same subset of eight games.
Researcher Affiliation Industry DeepMind, *Equal contribution {skapturowski,camunez,rayjiang,rakicevic,hado, cblundell,adriap}@deepmind.com
Pseudocode Yes Algorithm 1: Computation of the episodic intrinsic reward at time t: repisodic t .
Open Source Code No The paper states: 'In this manuscript we made additional efforts to make sure that the explanations of the proposed methods are detailed enough to be easily reproduced by the community.' but does not explicitly state that the source code for their method is open-sourced or provide a link to a repository.
Open Datasets Yes The Arcade Learning Environment (ALE) (Bellemare et al., 2013) was introduced as a benchmark to evaluate agents on an diverse set of tasks which are interesting to humans, and developed externally to the Reinforcement Learning (RL) community.
Dataset Splits No Hyperparameters have been tuned over a subset of eight games, encompassing games with different reward density, scale and requiring credit assignment over different time horizons: Frostbite, H.E.R.O., Montezuma s Revenge, Pitfall!, Skiing, Solaris, Surround, and Tennis. However, specific numerical splits (e.g., 80/10/10) for the overall dataset (frames) are not provided.
Hardware Specification Yes For the experiments we used the TPUv4, with the 2 2 1 topology used for the learner. Acting is accelerated by sending observations from actors to a shared server that runs batched inference using a 1 1 1 TPUv4, which is used for inference within the actor and evaluation workers.
Software Dependencies No The paper mentions 'Reverb (Cassirer et al., 2021)' and 'Adam W with Nesterov Momentum' but does not specify version numbers for any software dependencies or libraries.
Experiment Setup Yes An exhaustive description of the hyperparameters used is provided in Appendix A, and the network architecture in Appendix B. Table 2: Agent Hyper-parameters lists specific values such as 'Adam Learning Rate 3e-4' and 'Batch Size 64'.