SpeedyZero: Mastering Atari with Limited Data and Time

Authors: Yixuan Mei, Jiaxuan Gao, Weirui Ye, Shaohuai Liu, Yang Gao, Yi Wu

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop Speedy Zero, a distributed RL system built upon a state-of-the-art model-based RL method, Efficient Zero, with a dedicated system design for fast distributed computation. We evaluate Speedy Zero on the Atari 100k benchmark (Kaiser et al., 2019), Speedy Zero achieves human-level performance with only 35 minutes of training and 300k samples. Compared with Efficient Zero, which requires 8.5 hours of training, Speedy Zero retains a comparable sample efficiency while achieving a 14.5 speedup in wall-clock time.
Researcher Affiliation Academia Yixuan Mei1,2 , Jiaxuan Gao1,2 , Weirui Ye1, Shaohuai Liu1, Yang Gao1,2 , Yi Wu1,2 1 Institute for Interdisciplinary Information Sciences, Tsinghua University, 2 Shanghai Qi Zhi Institute
Pseudocode No The paper describes algorithms and presents mathematical formulas (e.g., for Clipped LARS) but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing open-source code or a link to a code repository for the described methodology.
Open Datasets Yes We evaluate Speedy Zero on the Atari 100k benchmark (Kaiser et al., 2019), Speedy Zero achieves human-level performance with only 35 minutes of training and 300k samples. It contains 26 Atari games that are deemed solvable with a limited amount of samples.
Dataset Splits No The paper mentions using the Atari 100k benchmark which typically has predefined splits, but it does not explicitly provide details about the training, validation, and test splits (e.g., percentages, sample counts, or specific split files) used in their experiments.
Hardware Specification Yes For the 35min experiments, the trainer node and the data node are both machines with 8 A100 80G GPUs (with NV-Switch), 128 CPU cores, and 1TB of RAM. There are 9 reanalysis nodes, each of which contains 4 A100 80G GPUs (with NV-Switch), 64 CPU cores, and 512GB of RAM. For the 50min experiments, the trainer node and the data node both contain 8 A100 80G GPUs (without NV-Switch), 128 CPU cores, and 512GB of RAM and the 15 reanalysis nodes all contain 1 NVIDIA RTX 3090 GPUs, 128 CPU cores, and 512GB of RAM.
Software Dependencies No The paper mentions using "Distributed Data Parallel provided by Py Torch (Li et al., 2020)" but does not specify a version number for PyTorch or any other software dependency.
Experiment Setup Yes For the main results in Sec. 5.2 and ablation study in Sec. 5.3, the trainer node is configured with 8 DDP trainers and each DDP trainer receives batches with batch size 256 for training, indicating a total batch size of 2048. The model held by each Reanalyze workers is updated every 25 training steps. The models of the priority refreshers and actors are updated every 10 training steps. The total number of training steps is 15k. Additionally, Table 6 lists "Common hyper-parameters of Speedy Zero" including optimizer, max gradient norm, priority exponent, evaluation episodes, and various coefficients. Table 8 and 9 list hyperparameters for large batch size experiments and PPO respectively. Table 11 provides "Key system configuration of Speedy Zero" with detailed numbers of actors, refreshers, buffer capacities, queue capacities, etc.