Learning more skills through optimistic exploration

Authors: DJ Strouse, Kate Baumli, David Warde-Farley, Volodymyr Mnih, Steven Stenberg Hansen

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate empirically that DISDAIN improves skill learning both in a tabular grid world (Four Rooms) and the 57 games of the Atari Suite (from pixels).
Researcher Affiliation Industry DJ Strouse , Kate Baumli, David Warde-Farley, Vlad Mnih, Steven Hansen DeepMind {strouse, baumli, dwf, vmnih, stevenhansen}@google.com
Pseudocode Yes Pseudocode for DISDAIN is provided in Algorithm 1.
Open Source Code Yes An open source reimplementation of DISDAIN on a smaller version of Four Rooms is available at http://github.com/deepmind/disdain.
Open Datasets Yes We validate DISDAIN by testing its ability to increase skill learning in an illustrative grid world (Four Rooms) as well as a more challenging pixel-based setting requiring function approximation (the 57 Atari games of the Arcade Learning Environment (Bellemare et al., 2013)).
Dataset Splits No The paper mentions using a 'distributed actor-learner setup' and discusses training and hyperparameters, but it does not specify explicit training, validation, and test splits for the datasets used in its experiments.
Hardware Specification Yes Our distributed reinforcement learning setup (Espeholt et al., 2018) used 100 CPU actors and a single V100 GPU learner.
Software Dependencies No Table 1 lists general software components like 'Adam' and 'Q(λ)' with citations to their respective papers, but it does not provide specific version numbers for these or other software libraries (e.g., Python, PyTorch versions) needed for reproduction.
Experiment Setup Yes Hyperparameter Atari Four Rooms Torso IMPALA Torso (Espeholt et al., 2018) tabular Head hidden size 256 Number of actors 100 64 Batch size 128 16 Skill trajectory length (T) 20 same Unroll length 20 same Actor update period 100 same Number of skill latents (NZ) 64 128 Replay buffer size 10^6 unrolls same Optimizer Adam (Kingma and Ba, 2015) SGD learning rate 2e-4 2e-3 Adam ε 10-3 Adam β1 0.0 Adam β2 0.95 RL algorithm Q(λ) (Peng and Williams, 1994) same λ 0.7 same discount γ 0.99 same Target update period 100 DISDAIN ensemble size (N) 40 2 DISDAIN reward weight (λ) 180.0 10.0 RND reward weight 0.3 Count bonus weight 10.0