Learning Dynamics and Generalization in Deep Reinforcement Learning

Authors: Clare Lyle, Mark Rowland, Will Dabney, Marta Kwiatkowska, Yarin Gal

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We corroborate these findings in deep RL agents trained on a range of environments, finding that neural networks trained using temporal difference algorithms on dense reward tasks exhibit weaker generalization between states than randomly initialized networks and networks trained with policy gradient methods.
Researcher Affiliation Collaboration 1Department of Computer Science, University of Oxford 2Deep Mind.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement indicating the release of open-source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes We train a standard deep Q-network (DQN) architecture on environments from the Atari 2600 suite... We run our evaluations in the Proc Gen environment (Cobbe et al., 2019), which consists of 16 games with procedurally generated levels.
Dataset Splits No The paper mentions training on 'a limited subset of the levels' (10 in Proc Gen) and evaluating 'on the full distribution' or 'a disjoint subset' for testing, but does not specify a separate validation set or split.
Hardware Specification No The paper does not explicitly state the specific hardware (e.g., GPU/CPU models, memory details) used for running its experiments. It mentions 'Due to computational constraints' but provides no further specifications.
Software Dependencies No The paper does not provide specific software dependencies with version numbers. It mentions that 'Our PPO and DAAC agents use the same hyperparameters and implementation as is provided by Raileanu and Fergus (2021)' but does not list software versions within this document.
Experiment Setup Yes We train the original agent for 50M frames using ϵ-greedy exploration with ϵ = 0.1, and train the distillation agents for a number of updates equivalent to 10M frames of data collected online... We set λ = 1e 2 in our evaluations... We use a replay capacity of 1e6...