Bigger, Better, Faster: Human-level Atari with human-level efficiency

Authors: Max Schwarzer, Johan Samir Obando Ceron, Aaron Courville, Marc G Bellemare, Rishabh Agarwal, Pablo Samuel Castro

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available.
Researcher Affiliation Collaboration 1Google Deep Mind 2Mila 3Universit e de Montr eal. Correspondence to: Max Schwarzer <Max A.Schwarzer@gmail.com>, Johan Obando Ceron <jobando0730@gmail.com>.
Pseudocode No The paper describes the algorithms and components in prose, but it does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We make our code and data publicly available.
Open Datasets Yes Mnih et al. (2015a) introduced the agent DQN by combining temporal-difference learning with deep networks, and demonstrated its capabilities in achieving human-level performance on the Arcade Learning Environment (ALE) (Bellemare et al., 2013). Kaiser et al. (2020) introduced the Atari 100K benchmark
Dataset Splits Yes While Atari 100K training set consists of 26 games, we evaluate the performance of various components in BBF on 29 validation games in ALE that are not in Atari 100K.
Hardware Specification Yes IRIS uses half of an A100 GPU for a week per run. SR-SPR, at its highest replay ratio of 16, uses 25% of an A100 GPU and a single CPU for roughly 24 hours. Our BBF agent at replay ratio 8 takes only 10 hours with a single CPU and half of an A100 GPU.
Software Dependencies No The paper mentions software like 'Dopamine framework', 'Python', 'NumPy', 'Matplotlib', and 'JAX', but does not provide specific version numbers for any of these components.
Experiment Setup Yes For BBF, we use RR=8 in order to balance the increased computation arising from our large network. Our n-step schedule... decreases exponentially from 10 to 3 over the first 10K gradient steps following each network reset... reset every 40k gradient steps. We choose γ1 = 0.97, slightly lower than the typical discount used for Atari, and γ2 = 0.997. We incorporate weight decay... use the Adam W optimizer (Loshchilov & Hutter, 2019) with a weight decay value of 0.1.