Learning Dynamics and Generalization in Deep Reinforcement Learning
Authors: Clare Lyle, Mark Rowland, Will Dabney, Marta Kwiatkowska, Yarin Gal
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods. |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, University of Oxford 2Deep Mind. |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement indicating the release of open-source code for the described methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | We train a standard deep Q-network (DQN) architecture on environments from the Atari 2600 suite... We run our evaluations in the Proc Gen environment (Cobbe et al., 2019), which consists of 16 games with procedurally generated levels. |
| Dataset Splits | No | The paper mentions training on 'a limited subset of the levels' (10 in Proc Gen) and evaluating 'on the full distribution' or 'a disjoint subset' for testing, but does not specify a separate validation set or split. |
| Hardware Specification | No | The paper does not explicitly state the specific hardware (e.g., GPU/CPU models, memory details) used for running its experiments. It mentions 'Due to computational constraints' but provides no further specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. It mentions that 'Our PPO and DAAC agents use the same hyperparameters and implementation as is provided by Raileanu and Fergus (2021)' but does not list software versions within this document. |
| Experiment Setup | Yes | We train the original agent for 50M frames using ϵ-greedy exploration with ϵ = 0.1, and train the distillation agents for a number of updates equivalent to 10M frames of data collected online... We set λ = 1e 2 in our evaluations... We use a replay capacity of 1e6... |