The Difficulty of Passive Learning in Deep Reinforcement Learning
Authors: Georg Ostrovski, Pablo Samuel Castro, Will Dabney
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the vein of Held & Hein s classic 1963 experiment, we propose the tandem learning experimental paradigm which facilitates our empirical analysis of the difficulties in offline reinforcement learning. We identify function approximation in conjunction with fixed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work. Our results provide relevant insights for offline deep reinforcement learning, while also shedding new light on phenomena observed in the online case of learning control. |
| Researcher Affiliation | Industry | Georg Ostrovski Deep Mind ostrovski@deepmind.com Pablo Samuel Castro Google Research, Brain Team psc@google.com Will Dabney Deep Mind wdabney@deepmind.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We provide two Tandem RL implementations: https://github.com/deepmind/deepmind-research/tree/ master/tandem_dqn based on the DQN Zoo [Quan and Ostrovski, 2020], and https://github.com/google/ dopamine/tree/master/dopamine/labs/tandem_dqn based on the Dopamine library [Castro et al., 2018]. |
| Open Datasets | Yes | Most of our experiments are performed on the Atari domain [Bellemare et al., 2013], using the exact algorithm and hyperparameters from [van Hasselt et al., 2016]. We ascertain the generality of this finding by replicating it across a broad suite of environments and agent architectures: Double-DQN on 57 Atari environments (Appendix Figs. 10 & 11), adapted agent variants on four Classic Control domains from the Open AI Gym library [Brockman et al., 2016] and the Min Atar domain [Young and Tian, 2019] (Appendix Figs. 12 & 15). |
| Dataset Splits | No | The paper describes training and evaluation steps, but does not provide specific train/validation/test dataset splits as percentages or sample counts, which are less common in online RL where data is generated interactively. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments. |
| Software Dependencies | No | The paper mentions software components like 'DQN Zoo', 'Dopamine library', 'JAX', 'RLax', 'Haiku', and 'Optax', but does not provide specific version numbers for any of these dependencies. |
| Experiment Setup | Yes | Most of our experiments are performed on the Atari domain [Bellemare et al., 2013], using the exact algorithm and hyperparameters from [van Hasselt et al., 2016]. Following the usual training protocol [Mnih et al., 2015], the total training budget is 200 iterations, each of which consists of 1M steps taken on the environment by the active agent, interspersed with regular learning updates (on one, or concurrently on both agents, depending on the paradigm), on batches of transitions sampled from the active agent s replay buffer. Both agents are independently evaluated on the environment for 500K steps after each training iteration. |