reproducibilityindex.ai

Leveraging Procedural Generation to Benchmark Reinforcement Learning

Authors: Karl Cobbe, Chris Hesse, Jacob Hilton, John Schulman

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efﬁciency and generalization in reinforcement learning. We empirically demonstrate that diverse environment distributions are essential to adequately train and evaluate RL agents, thereby motivating the extensive use of procedural content generation. We then use this benchmark to investigate the effects of scaling model size, ﬁnding that larger models signiﬁcantly improve both sample efﬁciency and generalization.
Researcher Affiliation	Industry	1Open AI, San Francisco, CA, USA. Correspondence to: Karl Cobbe <karl@openai.com>.
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	All environments are open-source and can be found at https://github.com/openai/procgen.
Open Datasets	Yes	We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments... All environments are open-source and can be found at https://github.com/openai/procgen.
Dataset Splits	No	The paper describes training and testing on levels: 'When evaluating generalization, we train on a ﬁnite set of levels and we test on the full distribution of levels. Unless otherwise speciﬁed, we use a training set of 500 levels to evaluate generalization in each environment.' However, it does not mention a distinct validation set or split.
Hardware Specification	No	The paper states: 'training for 200M timesteps with PPO on a single Procgen environment requires approximately 24 GPU-hrs and 60 CPU-hrs.' This mentions 'GPU' and 'CPU' generally, but no specific models, brands, or detailed hardware specifications are provided.
Software Dependencies	No	The paper mentions using 'Proximal Policy Optimization (Schulman et al., 2017)' and 'Rainbow (Hessel et al., 2018)' as algorithms, and 'IMPALA (Espeholt et al., 2018)' for the convolutional architecture. However, it does not specify version numbers for any software components, libraries, or programming languages used.
Experiment Setup	Yes	By default, we train agents using Proximal Policy Optimization (Schulman et al., 2017) for 200M timesteps... We recommend training easy difﬁculty environments for 25M timesteps... When we scale the number of IMPALA channels by k, we also scale the learning rate by 1/√k... We performed sweeps over other hyperparameters, including the batch size and the number of epochs per rollout... See Appendix D for a full list of Rainbow hyperparameters.