Entropic Desired Dynamics for Intrinsic Control

Authors: Steven Hansen, Guillaume Desjardins, Kate Baumli, David Warde-Farley, Nicolas Heess, Simon Osindero, Volodymyr Mnih

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we evaluate EDDICT s learned representations and behavior, and contrast them to prior work in the space of intrinsic control (or skill discovery) methods. We assess the learned representations qualitatively by looking at how well they correspond to privileged information known to be relevant to down stream tasks. Namely, the state dimensions given in the Deep Mind Control Suite [43] and the avatar coordinates in the Atari Learning Environment (ALE) [9]. We stress that this privileged information is not used during training in any way, with reverse predictors operating on the same input as the Q-function. Table 1: Results on 6 Atari games at 1B frames.
Researcher Affiliation Industry Steven Hansen Deep Mind Guillaume Desjardins Deep Mind Kate Baumli Deep Mind David Warde-Farley Deep Mind Nicolas Heess Deep Mind Simon Osindero Deep Mind Volodymyr Mnih Deep Mind Correspondence to stevenhansen@deepmind.com
Pseudocode Yes Algorithm 1: EDDICT
Open Source Code No The paper mentions "All algorithms were implemented in the same codebase" but does not provide concrete access (link, explicit statement of release) to the source code for the described methodology.
Open Datasets Yes Namely, the state dimensions given in the Deep Mind Control Suite [43] and the avatar coordinates in the Atari Learning Environment (ALE) [9].
Dataset Splits No The paper mentions training models and using a replay buffer, but it does not provide specific details about train/validation/test splits (e.g., percentages, sample counts, or references to predefined splits).
Hardware Specification No The paper mentions using a "distributed deep reinforcement learning system" but does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types.
Software Dependencies No The paper mentions algorithms and architectures like "Peng s Q(λ)", "ϵ-greedy policies", and "Res Net" but does not specify any software libraries or dependencies with version numbers.
Experiment Setup Yes In practice, we optimize the above objective using a value-based reinforcement learning algorithm and ϵ-greedy policies (in lieu of a Boltzmann policy), and thus omit these terms. Concretely, this can be implemented by treating each option period as a pseudo-episode, i.e. using discount factors which are zero on option boundaries as shown in Algorithm 1. Concretely, we parameterize the action-value function Qθ(s, a, z) as an MLP operating on state embeddings, derived from a Res Net [24], and linear action and code embeddings.