Quantifying Generalization in Reinforcement Learning

Authors: Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train 9 agents to play Coin Run, each on a training set with a different number of levels. ... We first train agents with policies using the same 3-layer convolutional architecture proposed by (Mnih et al., 2015), which we henceforth call Nature-CNN. ... Results are shown in Figure 2a.
Researcher Affiliation Industry 1Open AI, San Francisco, CA, USA. Correspondence to: Karl Cobbe <karl@openai.com>.
Pseudocode No The paper describes methods like PPO and architectural details, but does not present any pseudocode or algorithm blocks.
Open Source Code Yes Videos of a trained agent playing can be found here, and environment code can be found here.
Open Datasets No The paper uses a procedurally generated environment (Coin Run) of their own design and states 'Each level is generated deterministically from a given seed, providing agents access to an arbitrarily large and easily quantifiable supply of training data.' However, it does not provide concrete access (link, citation, repository) to a pre-existing publicly available dataset of these levels.
Dataset Splits No The paper discusses training and test sets but does not explicitly mention or detail a separate validation set or split for hyperparameter tuning.
Hardware Specification No The paper does not provide specific details on the hardware used for running experiments (e.g., specific GPU or CPU models).
Software Dependencies No The paper mentions software like 'Proximal Policy Optimization' and 'OpenAI Baselines' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We use γ = .999, as an optimal agent takes between 50 and 500 timesteps to solve a level, depending on level difficulty. See Appendix D for a full list of hyperparameters. ... We train agents with Proximal Policy Optimization (Schulman et al., 2017; Dhariwal et al., 2017) for a total of 256M timesteps across 8 workers. ... We first train agents with either dropout probability p 2 [0, 0.25] or with L2 penalty w 2 [0, 2.5 10 4].