Unsupervised Control Through Non-Parametric Discriminative Rewards

Authors: David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, Volodymyr Mnih

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the efficacy of our agent to learn, in an unsupervised manner, to reach a diverse set of goals on three domains Atari, the Deep Mind Control Suite and Deep Mind Lab. We evaluate, both qualitatively and quantitatively, the ability of DISCERN to achieve visually-specified goals in three diverse domains the Arcade Learning Environment (Bellemare et al., 2013), continuous control tasks in the Deep Mind Control Suite (Tassa et al., 2018), and Deep Mind Lab (Beattie et al., 2016). We compared DISCERN to several baseline methods for learning goal-conditioned policies.
Researcher Affiliation Academia Anonymous authors Paper under double-blind review. No author affiliations are provided in the paper.
Pseudocode Yes Pseudocode for the DISCERN algorithm, decomposed into an experience-gathering (possibly distributed) actor process and a centralized learner process, is given in Algorithm 1.
Open Source Code No The paper provides a link to videos of the goal-conditioned policies ('https://sites.google.com/view/discern-anonymous/home') but does not state that the source code for the methodology is available or provide a link to a code repository.
Open Datasets Yes We evaluate, both qualitatively and quantitatively, the ability of DISCERN to achieve visually-specified goals in three diverse domains the Arcade Learning Environment (Bellemare et al., 2013), continuous control tasks in the Deep Mind Control Suite (Tassa et al., 2018), and Deep Mind Lab (Beattie et al., 2016).
Dataset Splits No The paper describes how goals are sampled and used for training and evaluation, but it does not specify explicit train/validation/test dataset splits in the traditional sense of a fixed dataset (e.g., percentages or counts for each split).
Hardware Specification No The paper mentions a 'centralized GPU learner' and 'CPU-based parallel actors' but does not specify any particular models or detailed specifications of the hardware used for experiments.
Software Dependencies No The paper mentions using RMSProp and preprocessing protocols from other works, but it does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python version, TensorFlow/PyTorch version).
Experiment Setup Yes The following hyper-parameters were used in all of the experiments described in Section 6. All weight matrices are initialized using a standard truncated normal initializer, with the standard deviation inversely proportional to the square root of the fan-in. We maintain a goal buffer of size 1024 and use preplace = 10^-3. We also use padd non diverse = 10^-3. For the teacher, we choose ξφ( ) to be an L2-normalized single layer of 32 tanh units, trained in all experiments with 4 decoys (and thus, according to our heuristic, β equal to 5). For hindsight experience replay, a highsight goal is substituted 25% of the time. These goals are chosen uniformly at random from the last 3 frames of the trajectory. Trajectories were set to be 50 steps long for Atari and Deep Mind Lab and 100 for the Deep Mind control suite. We train the agent and teacher jointly with RMSProp (Tieleman & Hinton, 2012) with a learning rate of 10^-4.