Leveraging Procedural Generation to Benchmark Reinforcement Learning

Authors: Karl Cobbe, Chris Hesse, Jacob Hilton, John Schulman

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning. We empirically demonstrate that diverse environment distributions are essential to adequately train and evaluate RL agents, thereby motivating the extensive use of procedural content generation. We then use this benchmark to investigate the effects of scaling model size, finding that larger models significantly improve both sample efficiency and generalization.
Researcher Affiliation Industry 1Open AI, San Francisco, CA, USA. Correspondence to: Karl Cobbe <karl@openai.com>.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes All environments are open-source and can be found at https://github.com/openai/procgen.
Open Datasets Yes We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments... All environments are open-source and can be found at https://github.com/openai/procgen.
Dataset Splits No The paper describes training and testing on levels: 'When evaluating generalization, we train on a finite set of levels and we test on the full distribution of levels. Unless otherwise specified, we use a training set of 500 levels to evaluate generalization in each environment.' However, it does not mention a distinct validation set or split.
Hardware Specification No The paper states: 'training for 200M timesteps with PPO on a single Procgen environment requires approximately 24 GPU-hrs and 60 CPU-hrs.' This mentions 'GPU' and 'CPU' generally, but no specific models, brands, or detailed hardware specifications are provided.
Software Dependencies No The paper mentions using 'Proximal Policy Optimization (Schulman et al., 2017)' and 'Rainbow (Hessel et al., 2018)' as algorithms, and 'IMPALA (Espeholt et al., 2018)' for the convolutional architecture. However, it does not specify version numbers for any software components, libraries, or programming languages used.
Experiment Setup Yes By default, we train agents using Proximal Policy Optimization (Schulman et al., 2017) for 200M timesteps... We recommend training easy difficulty environments for 25M timesteps... When we scale the number of IMPALA channels by k, we also scale the learning rate by 1/√k... We performed sweeps over other hyperparameters, including the batch size and the number of epochs per rollout... See Appendix D for a full list of Rainbow hyperparameters.