reproducibilityindex.ai

The Difficulty of Passive Learning in Deep Reinforcement Learning

Authors: Georg Ostrovski, Pablo Samuel Castro, Will Dabney

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In the vein of Held & Hein s classic 1963 experiment, we propose the tandem learning experimental paradigm which facilitates our empirical analysis of the difﬁculties in ofﬂine reinforcement learning. We identify function approximation in conjunction with ﬁxed data distributions as the strongest factors, thereby extending but also challenging hypotheses stated in past work. Our results provide relevant insights for ofﬂine deep reinforcement learning, while also shedding new light on phenomena observed in the online case of learning control.
Researcher Affiliation	Industry	Georg Ostrovski Deep Mind ostrovski@deepmind.com Pablo Samuel Castro Google Research, Brain Team psc@google.com Will Dabney Deep Mind wdabney@deepmind.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We provide two Tandem RL implementations: https://github.com/deepmind/deepmind-research/tree/ master/tandem_dqn based on the DQN Zoo [Quan and Ostrovski, 2020], and https://github.com/google/ dopamine/tree/master/dopamine/labs/tandem_dqn based on the Dopamine library [Castro et al., 2018].
Open Datasets	Yes	Most of our experiments are performed on the Atari domain [Bellemare et al., 2013], using the exact algorithm and hyperparameters from [van Hasselt et al., 2016]. We ascertain the generality of this ﬁnding by replicating it across a broad suite of environments and agent architectures: Double-DQN on 57 Atari environments (Appendix Figs. 10 & 11), adapted agent variants on four Classic Control domains from the Open AI Gym library [Brockman et al., 2016] and the Min Atar domain [Young and Tian, 2019] (Appendix Figs. 12 & 15).
Dataset Splits	No	The paper describes training and evaluation steps, but does not provide specific train/validation/test dataset splits as percentages or sample counts, which are less common in online RL where data is generated interactively.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory) used to run its experiments.
Software Dependencies	No	The paper mentions software components like 'DQN Zoo', 'Dopamine library', 'JAX', 'RLax', 'Haiku', and 'Optax', but does not provide specific version numbers for any of these dependencies.
Experiment Setup	Yes	Most of our experiments are performed on the Atari domain [Bellemare et al., 2013], using the exact algorithm and hyperparameters from [van Hasselt et al., 2016]. Following the usual training protocol [Mnih et al., 2015], the total training budget is 200 iterations, each of which consists of 1M steps taken on the environment by the active agent, interspersed with regular learning updates (on one, or concurrently on both agents, depending on the paradigm), on batches of transitions sampled from the active agent s replay buffer. Both agents are independently evaluated on the environment for 500K steps after each training iteration.