Procedural generalization by planning with self-supervised world models
Authors: Ankesh Anand, Jacob C Walker, Yazhe Li, Eszter Vértes, Julian Schrittwieser, Sherjil Ozair, Theophane Weber, Jessica B Hamrick
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we explicitly measure the generalization ability of model-based agents in comparison to their model-free counterparts. We focus our analysis on Mu Zero [60], a powerful model-based agent, and evaluate its performance on both procedural and task generalization. We identify three factors of procedural generalization planning, self-supervised representation learning, and procedural data diversity and show that by combining these techniques, we achieve state-of-the art generalization performance and data efficiency on Procgen [9].Our results broadly indicate that self-supervised, model-based agents hold promise in making progress towards better generalization. We find that (1) Mu Zero achieves state-of-the-art performance on Procgen and the procedural and multi-task Meta World benchmarks (ML-1 and ML45 train), outperforming a controlled model-free baseline; |
| Researcher Affiliation | Collaboration | Deep Mind, London, UK and Work done while visiting from Mila, University of Montreal. |
| Pseudocode | No | The paper describes algorithmic details in prose (e.g., Section 2.3 and Appendix A), but does not contain any structured pseudocode or algorithm blocks with explicit labels like 'Algorithm' or 'Pseudocode'. |
| Open Source Code | No | The paper mentions that the environments (Procgen and Meta-World) are publicly available and refers to published work for Mu Zero, but does not explicitly state that the authors' implementation or modifications are open-source or provide a link to their code. |
| Open Datasets | Yes | We focus on two benchmarks designed for both types of generalization, Procgen [9] and Meta-World [74]. |
| Dataset Splits | No | The paper mentions varying the number of training levels and evaluating on an 'infinite test split' for Procgen, and training for a certain number of frames on Meta-World, but does not specify a separate validation dataset split or strategy. |
| Hardware Specification | Yes | For each game, we trained our model using 2 TPUv3-8 machines. A separate actor gathered environment trajectories with 1 TPUv3-8 machine. |
| Software Dependencies | No | The paper refers to specific versions of the Mu Zero agent (e.g., 'Mu Zero Reanalyse [61]', 'Sampled Mu Zero [32]') and mentions a 'Res Net' architecture, but does not provide specific version numbers for software libraries, programming languages, or other dependencies like TensorFlow, PyTorch, Python, etc. |
| Experiment Setup | Yes | We list major hyper-parameters used in this work in Table A.1. HYPER-PARAMETER VALUE (PROCGEN) VALUE (META-WORLD) Model Unroll Length 5 5 TD-Steps 5 5 Re Analyse Fraction 0.945 0.95 Replay Size (in sequences) 50000 2000 Number of Simulations 50 50 UCB-constant 1.25 1.25 Number of Samples n/a 20 SELF-SUPERVISION Reconstruction Loss Weight 1.0 1.0 Contrastive Loss Weight 1.0 0.1 SPR Loss Weight 10.0 1.0 OPTIMIZATION Optimizer Adam Adam Initial Learning Rate 10 4 10 4 Batch Size 1024 1024 |