Quantifying Generalization in Reinforcement Learning
Authors: Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, John Schulman
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train 9 agents to play Coin Run, each on a training set with a different number of levels. ... We first train agents with policies using the same 3-layer convolutional architecture proposed by (Mnih et al., 2015), which we henceforth call Nature-CNN. ... Results are shown in Figure 2a. |
| Researcher Affiliation | Industry | 1Open AI, San Francisco, CA, USA. Correspondence to: Karl Cobbe <karl@openai.com>. |
| Pseudocode | No | The paper describes methods like PPO and architectural details, but does not present any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Videos of a trained agent playing can be found here, and environment code can be found here. |
| Open Datasets | No | The paper uses a procedurally generated environment (Coin Run) of their own design and states 'Each level is generated deterministically from a given seed, providing agents access to an arbitrarily large and easily quantifiable supply of training data.' However, it does not provide concrete access (link, citation, repository) to a pre-existing publicly available dataset of these levels. |
| Dataset Splits | No | The paper discusses training and test sets but does not explicitly mention or detail a separate validation set or split for hyperparameter tuning. |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for running experiments (e.g., specific GPU or CPU models). |
| Software Dependencies | No | The paper mentions software like 'Proximal Policy Optimization' and 'OpenAI Baselines' but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We use γ = .999, as an optimal agent takes between 50 and 500 timesteps to solve a level, depending on level difficulty. See Appendix D for a full list of hyperparameters. ... We train agents with Proximal Policy Optimization (Schulman et al., 2017; Dhariwal et al., 2017) for a total of 256M timesteps across 8 workers. ... We first train agents with either dropout probability p 2 [0, 0.25] or with L2 penalty w 2 [0, 2.5 10 4]. |