The Difficulty of Passive Learning in Deep Reinforcement Learning

Authors: Georg Ostrovski, Pablo Samuel Castro, Will Dabney

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the vein of Held & Hein s classic 1963 experiment, we propose the tandem learning experimental paradigm which facilitates our empirical analysis of the difficulties in offline reinforcement learning. We identify function approximation in conjunction with fixed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work. Our results provide relevant insights for offline deep reinforcement learning, while also shedding new light on phenomena observed in the online case of learning control.
Researcher Affiliation Industry Georg Ostrovski Deep Mind ostrovski@deepmind.com Pablo Samuel Castro Google Research, Brain Team psc@google.com Will Dabney Deep Mind wdabney@deepmind.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We provide two Tandem RL implementations: https://github.com/deepmind/deepmind-research/tree/ master/tandem_dqn based on the DQN Zoo [Quan and Ostrovski, 2020], and https://github.com/google/ dopamine/tree/master/dopamine/labs/tandem_dqn based on the Dopamine library [Castro et al., 2018].
Open Datasets Yes Most of our experiments are performed on the Atari domain [Bellemare et al., 2013], using the exact algorithm and hyperparameters from [van Hasselt et al., 2016]. We ascertain the generality of this finding by replicating it across a broad suite of environments and agent architectures: Double-DQN on 57 Atari environments (Appendix Figs. 10 & 11), adapted agent variants on four Classic Control domains from the Open AI Gym library [Brockman et al., 2016] and the Min Atar domain [Young and Tian, 2019] (Appendix Figs. 12 & 15).
Dataset Splits No The paper describes training and evaluation steps, but does not provide specific train/validation/test dataset splits as percentages or sample counts, which are less common in online RL where data is generated interactively.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments.
Software Dependencies No The paper mentions software components like 'DQN Zoo', 'Dopamine library', 'JAX', 'RLax', 'Haiku', and 'Optax', but does not provide specific version numbers for any of these dependencies.
Experiment Setup Yes Most of our experiments are performed on the Atari domain [Bellemare et al., 2013], using the exact algorithm and hyperparameters from [van Hasselt et al., 2016]. Following the usual training protocol [Mnih et al., 2015], the total training budget is 200 iterations, each of which consists of 1M steps taken on the environment by the active agent, interspersed with regular learning updates (on one, or concurrently on both agents, depending on the paradigm), on batches of transitions sampled from the active agent s replay buffer. Both agents are independently evaluated on the environment for 500K steps after each training iteration.