reproducibilityindex.ai

Learning more skills through optimistic exploration

Authors: DJ Strouse, Kate Baumli, David Warde-Farley, Volodymyr Mnih, Steven Stenberg Hansen

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels).
Researcher Affiliation	Industry	DJ Strouse , Kate Baumli, David Warde-Farley, Vlad Mnih, Steven Hansen DeepMind {strouse, baumli, dwf, vmnih, stevenhansen}@google.com
Pseudocode	Yes	Pseudocode for DISDAIN is provided in Algorithm 1.
Open Source Code	Yes	An open source reimplementation of DISDAIN on a smaller version of Four Rooms is available at http://github.com/deepmind/disdain.
Open Datasets	Yes	We validate DISDAIN by testing its ability to increase skill learning in an illustrative grid world (Four Rooms) as well as a more challenging pixel-based setting requiring function approximation (the 57 Atari games of the Arcade Learning Environment (Bellemare et al., 2013)).
Dataset Splits	No	The paper mentions using a 'distributed actor-learner setup' and discusses training and hyperparameters, but it does not specify explicit training, validation, and test splits for the datasets used in its experiments.
Hardware Specification	Yes	Our distributed reinforcement learning setup (Espeholt et al., 2018) used 100 CPU actors and a single V100 GPU learner.
Software Dependencies	No	Table 1 lists general software components like 'Adam' and 'Q(λ)' with citations to their respective papers, but it does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch versions) needed for reproduction.
Experiment Setup	Yes	Hyperparameter Atari Four Rooms Torso IMPALA Torso (Espeholt et al., 2018) tabular Head hidden size 256 Number of actors 100 64 Batch size 128 16 Skill trajectory length (T) 20 same Unroll length 20 same Actor update period 100 same Number of skill latents (NZ) 64 128 Replay buffer size 10^6 unrolls same Optimizer Adam (Kingma and Ba, 2015) SGD learning rate 2e-4 2e-3 Adam ε 10-3 Adam β1 0.0 Adam β2 0.95 RL algorithm Q(λ) (Peng and Williams, 1994) same λ 0.7 same discount γ 0.99 same Target update period 100 DISDAIN ensemble size (N) 40 2 DISDAIN reward weight (λ) 180.0 10.0 RND reward weight 0.3 Count bonus weight 10.0