reproducibilityindex.ai

Quantifying Generalization in Reinforcement Learning

Authors: Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train 9 agents to play Coin Run, each on a training set with a different number of levels. ... We ﬁrst train agents with policies using the same 3-layer convolutional architecture proposed by (Mnih et al., 2015), which we henceforth call Nature-CNN. ... Results are shown in Figure 2a.
Researcher Affiliation	Industry	1Open AI, San Francisco, CA, USA. Correspondence to: Karl Cobbe <karl@openai.com>.
Pseudocode	No	The paper describes methods like PPO and architectural details, but does not present any pseudocode or algorithm blocks.
Open Source Code	Yes	Videos of a trained agent playing can be found here, and environment code can be found here.
Open Datasets	No	The paper uses a procedurally generated environment (Coin Run) of their own design and states 'Each level is generated deterministically from a given seed, providing agents access to an arbitrarily large and easily quantiﬁable supply of training data.' However, it does not provide concrete access (link, citation, repository) to a pre-existing publicly available dataset of these levels.
Dataset Splits	No	The paper discusses training and test sets but does not explicitly mention or detail a separate validation set or split for hyperparameter tuning.
Hardware Specification	No	The paper does not provide specific details on the hardware used for running experiments (e.g., specific GPU or CPU models).
Software Dependencies	No	The paper mentions software like 'Proximal Policy Optimization' and 'OpenAI Baselines' but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We use γ = .999, as an optimal agent takes between 50 and 500 timesteps to solve a level, depending on level difﬁculty. See Appendix D for a full list of hyperparameters. ... We train agents with Proximal Policy Optimization (Schulman et al., 2017; Dhariwal et al., 2017) for a total of 256M timesteps across 8 workers. ... We ﬁrst train agents with either dropout probability p 2 [0, 0.25] or with L2 penalty w 2 [0, 2.5 10 4].