Accelerating Reinforcement Learning through GPU Atari Emulation

Authors: Steven Dalton, iuri frosio

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Cu LE generates up to 155M frames per hour on a single GPU, a finding previously achieved only through a cluster of CPUs. Beyond highlighting the differences between CPU and GPU emulators in the context of reinforcement learning, we show how to leverage the high throughput of Cu LE by effective batching of the training data, and show accelerated convergence for A2C+V-trace.
Researcher Affiliation Industry Steven Dalton , Iuri Frosio NVIDIA, USA {sdalton,ifrosio}@nvidia.com
Pseudocode No The paper describes the architecture and functionality of Cu LE in text, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Cu LE is available at https://github.com/NVlabs/cule.
Open Datasets Yes We focus our attention on the inference path and move from the traditional CPU implementation of the Atari Learning Environment (ALE), a set of Atari 2600 games that emerged as an excellent DRL benchmark [3, 11]. We show that significant performance bottlenecks stem from CPU-based environment emulation... Open AI Gym [15]
Dataset Splits No The paper mentions using Atari games and Open AI Gym environments but does not specify explicit training, validation, or test dataset splits with percentages or sample counts.
Hardware Specification Yes Table 2: Systems used for experiments. System I 12-core Core i7-5930K @3.50GHz Titan V; System II 6-core Core i7-8086K @5GHz Tesla V100; System III 20-core Core E5-2698 v4 @2.20GHz 2 Tesla V100 8, NVLink
Software Dependencies No The paper mentions 'Our Py Torch [16] implementation of A2C' but does not specify the version of PyTorch or any other software dependencies with version numbers.
Experiment Setup Yes in our experiments we use a vanilla A2C [15], with N-step bootstrapping, and N = 5 as the baseline; This configuration takes, on average, 21.2 minutes (and 5.5M training frames) to reach a score of 18 for Pong and 16.6 minutes (4.0M training frames) for a score of 1,500 on Ms-Pacman; Furthermore, as only the most recent data in a batch are generated with the current policy, we use V-trace [6] for off-policy correction. Extending the batch size in the temporal dimension (N-steps bootstrapping, N = 20)