Planning to Explore via Self-Supervised World Models

Authors: Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards.
Researcher Affiliation Collaboration 1University of Pennsylvania 2UC Berkeley 3Google Research, Brain Team 4University of Toronto 5Carnegie Mellon University 6Facebook AI Research.
Pseudocode Yes Algorithm 1 Planning to Explore via Latent Disagreement
Open Source Code Yes Videos and code: https://ramanans1.github.io/ plan2explore/
Open Datasets Yes Environment Details We use the DM Control Suite (Tassa et al., 2018), a standard benchmark for continuous control.
Dataset Splits No The paper describes exploration steps and adaptation phases (zero-shot, few-shot) but does not specify explicit train/validation/test dataset splits with percentages or counts for reproducibility.
Hardware Specification No The paper does not explicitly describe the specific hardware (GPU models, CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using 'Dreamer (Hafner et al., 2020)' as the base agent but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or other libraries.
Experiment Setup Yes We use (Hafner et al., 2020) with the original hyperparameters unless specified otherwise to optimize both exploration and task policies of Plan2Explore. We found that additional capacity provided by increasing the hidden size of the GRU in the latent dynamics model to 400 and the deterministic and stochastic components of the latent space to 60 helped performance. For a fair comparison, we maintain this model size for Dreamer and other baselines. For latent disagreement, we use an ensemble of 5 one-step prediction models implemented as 2 hidden-layer MLP. Full details are in the appendix.