Bootstrapped Meta-Learning

Authors: Sebastian Flennerhag, Yannick Schroecker, Tom Zahavy, Hado van Hasselt, David Silver, Satinder Singh

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we find that BMG provides substantial performance improvements over standard meta-gradients in various settings. We obtain a new state-of-the-art result for model-free agents on Atari (Section 5.2) and improve upon MAML (Finn et al., 2017) in the few-shot setting (Section 6).
Researcher Affiliation Industry Sebastian Flennerhag Deep Mind flennerhag@google.com Yannick Schroecker Deep Mind Tom Zahavy Deep Mind Hado van Hasselt Deep Mind David Silver Deep Mind Satinder Singh Deep Mind
Pseudocode Yes Algorithm 1 N-step RL actor loop
Open Source Code No No explicit statement about the authors' own source code being released or available through a link.
Open Datasets Yes Mini Imagenet (Vinyals et al., 2016; Ravi & Larochelle, 2017) is a sub-sample of the Imagenet dataset (Deng et al., 2009). Specifically, it is a subset of 100 classes sampled randomly from the 1000 classes in the ILSVRC-12 training set, with 600 images for each class. We follow the standard protocol (Ravi & Larochelle, 2017) and split classes into a non-overlapping meta-training, meta-validation, and meta-tests sets with 64, 16, and 20 classes in each, respectively. The datasset is licenced under the MIT licence and the ILSVRC licence. The dataset can be obtained from https://paperswithcode. com/dataset/miniimagenet-1.
Dataset Splits Yes We follow the standard protocol (Ravi & Larochelle, 2017) and split classes into a non-overlapping meta-training, meta-validation, and meta-tests sets with 64, 16, and 20 classes in each, respectively.
Hardware Specification Yes IMPALA s distributed setup is implemented on a single machine with 56 CPU cores and 8 TPU (Jouppi et al., 2017) cores. 2 TPU cores are used to act in 48 environments asynchronously in parallel, sending rollouts to a replay buffer that a centralized learner use to update agent parameters and meta-parameters. Gradient computations are distributed along the batch dimension across the remaining 6 TPU cores. All Atari experiments use this setup; training for 200 millions frames takes 24 hours. Each model is trained on a single machine and runs on a V100 NVIDIA GPU.
Software Dependencies No No specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow, etc.) are provided.
Experiment Setup Yes Table 1: Two-colors hyper-parameters and Table 2: Atari hyper-parameters contain detailed lists of hyperparameters for optimizers, network architectures, and RL-specific parameters. For instance: Optimiser SGD Learning rate 0.1 Batch size 16 (losses are averaged) γ 0.99 for Actor-critic Inner Learner.