Procedural generalization by planning with self-supervised world models

Authors: Ankesh Anand, Jacob C Walker, Yazhe Li, Eszter Vértes, Julian Schrittwieser, Sherjil Ozair, Theophane Weber, Jessica B Hamrick

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we explicitly measure the generalization ability of model-based agents in comparison to their model-free counterparts. We focus our analysis on Mu Zero [60], a powerful model-based agent, and evaluate its performance on both procedural and task generalization. We identify three factors of procedural generalization planning, self-supervised representation learning, and procedural data diversity and show that by combining these techniques, we achieve state-of-the art generalization performance and data efficiency on Procgen [9].Our results broadly indicate that self-supervised, model-based agents hold promise in making progress towards better generalization. We find that (1) Mu Zero achieves state-of-the-art performance on Procgen and the procedural and multi-task Meta World benchmarks (ML-1 and ML45 train), outperforming a controlled model-free baseline;
Researcher Affiliation Collaboration Deep Mind, London, UK and Work done while visiting from Mila, University of Montreal.
Pseudocode No The paper describes algorithmic details in prose (e.g., Section 2.3 and Appendix A), but does not contain any structured pseudocode or algorithm blocks with explicit labels like 'Algorithm' or 'Pseudocode'.
Open Source Code No The paper mentions that the environments (Procgen and Meta-World) are publicly available and refers to published work for Mu Zero, but does not explicitly state that the authors' implementation or modifications are open-source or provide a link to their code.
Open Datasets Yes We focus on two benchmarks designed for both types of generalization, Procgen [9] and Meta-World [74].
Dataset Splits No The paper mentions varying the number of training levels and evaluating on an 'infinite test split' for Procgen, and training for a certain number of frames on Meta-World, but does not specify a separate validation dataset split or strategy.
Hardware Specification Yes For each game, we trained our model using 2 TPUv3-8 machines. A separate actor gathered environment trajectories with 1 TPUv3-8 machine.
Software Dependencies No The paper refers to specific versions of the Mu Zero agent (e.g., 'Mu Zero Reanalyse [61]', 'Sampled Mu Zero [32]') and mentions a 'Res Net' architecture, but does not provide specific version numbers for software libraries, programming languages, or other dependencies like TensorFlow, PyTorch, Python, etc.
Experiment Setup Yes We list major hyper-parameters used in this work in Table A.1. HYPER-PARAMETER VALUE (PROCGEN) VALUE (META-WORLD) Model Unroll Length 5 5 TD-Steps 5 5 Re Analyse Fraction 0.945 0.95 Replay Size (in sequences) 50000 2000 Number of Simulations 50 50 UCB-constant 1.25 1.25 Number of Samples n/a 20 SELF-SUPERVISION Reconstruction Loss Weight 1.0 1.0 Contrastive Loss Weight 1.0 0.1 SPR Loss Weight 10.0 1.0 OPTIMIZATION Optimizer Adam Adam Initial Learning Rate 10 4 10 4 Batch Size 1024 1024