Planning to Explore via Self-Supervised World Models
Authors: Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate on challenging control tasks from high-dimensional image inputs. Without any training supervision or task-specific interaction, Plan2Explore outperforms prior self-supervised exploration methods, and in fact, almost matches the performances oracle which has access to rewards. |
| Researcher Affiliation | Collaboration | 1University of Pennsylvania 2UC Berkeley 3Google Research, Brain Team 4University of Toronto 5Carnegie Mellon University 6Facebook AI Research. |
| Pseudocode | Yes | Algorithm 1 Planning to Explore via Latent Disagreement |
| Open Source Code | Yes | Videos and code: https://ramanans1.github.io/ plan2explore/ |
| Open Datasets | Yes | Environment Details We use the DM Control Suite (Tassa et al., 2018), a standard benchmark for continuous control. |
| Dataset Splits | No | The paper describes exploration steps and adaptation phases (zero-shot, few-shot) but does not specify explicit train/validation/test dataset splits with percentages or counts for reproducibility. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (GPU models, CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Dreamer (Hafner et al., 2020)' as the base agent but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or other libraries. |
| Experiment Setup | Yes | We use (Hafner et al., 2020) with the original hyperparameters unless speciļ¬ed otherwise to optimize both exploration and task policies of Plan2Explore. We found that additional capacity provided by increasing the hidden size of the GRU in the latent dynamics model to 400 and the deterministic and stochastic components of the latent space to 60 helped performance. For a fair comparison, we maintain this model size for Dreamer and other baselines. For latent disagreement, we use an ensemble of 5 one-step prediction models implemented as 2 hidden-layer MLP. Full details are in the appendix. |