Unsupervised Control Through Non-Parametric Discriminative Rewards
Authors: David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, Volodymyr Mnih
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our agent to learn, in an unsupervised manner, to reach a diverse set of goals on three domains Atari, the Deep Mind Control Suite and Deep Mind Lab. We evaluate, both qualitatively and quantitatively, the ability of DISCERN to achieve visually-specified goals in three diverse domains the Arcade Learning Environment (Bellemare et al., 2013), continuous control tasks in the Deep Mind Control Suite (Tassa et al., 2018), and Deep Mind Lab (Beattie et al., 2016). We compared DISCERN to several baseline methods for learning goal-conditioned policies. |
| Researcher Affiliation | Academia | Anonymous authors Paper under double-blind review. No author affiliations are provided in the paper. |
| Pseudocode | Yes | Pseudocode for the DISCERN algorithm, decomposed into an experience-gathering (possibly distributed) actor process and a centralized learner process, is given in Algorithm 1. |
| Open Source Code | No | The paper provides a link to videos of the goal-conditioned policies ('https://sites.google.com/view/discern-anonymous/home') but does not state that the source code for the methodology is available or provide a link to a code repository. |
| Open Datasets | Yes | We evaluate, both qualitatively and quantitatively, the ability of DISCERN to achieve visually-specified goals in three diverse domains the Arcade Learning Environment (Bellemare et al., 2013), continuous control tasks in the Deep Mind Control Suite (Tassa et al., 2018), and Deep Mind Lab (Beattie et al., 2016). |
| Dataset Splits | No | The paper describes how goals are sampled and used for training and evaluation, but it does not specify explicit train/validation/test dataset splits in the traditional sense of a fixed dataset (e.g., percentages or counts for each split). |
| Hardware Specification | No | The paper mentions a 'centralized GPU learner' and 'CPU-based parallel actors' but does not specify any particular models or detailed specifications of the hardware used for experiments. |
| Software Dependencies | No | The paper mentions using RMSProp and preprocessing protocols from other works, but it does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python version, TensorFlow/PyTorch version). |
| Experiment Setup | Yes | The following hyper-parameters were used in all of the experiments described in Section 6. All weight matrices are initialized using a standard truncated normal initializer, with the standard deviation inversely proportional to the square root of the fan-in. We maintain a goal buffer of size 1024 and use preplace = 10^-3. We also use padd non diverse = 10^-3. For the teacher, we choose ξφ( ) to be an L2-normalized single layer of 32 tanh units, trained in all experiments with 4 decoys (and thus, according to our heuristic, β equal to 5). For hindsight experience replay, a highsight goal is substituted 25% of the time. These goals are chosen uniformly at random from the last 3 frames of the trajectory. Trajectories were set to be 50 steps long for Atari and Deep Mind Lab and 100 for the Deep Mind control suite. We train the agent and teacher jointly with RMSProp (Tieleman & Hinton, 2012) with a learning rate of 10^-4. |