Continuous-Time Meta-Learning with Forward Mode Differentiation

Authors: Tristan Deleu, David Kanaa, Leo Feng, Giancarlo Kerg, Yoshua Bengio, Guillaume Lajoie, Pierre-Luc Bacon

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
Researcher Affiliation Academia Tristan Deleu David Kanaa Leo Feng Giancarlo Kerg Yoshua Bengio 1,2 Guillaume Lajoie 2 Pierre-Luc Bacon 2 Mila Université de Montréal 1CIFAR Senior Fellow, 2CIFAR AI Chair
Pseudocode Yes We give in Algorithm 1 the pseudo-code for meta-training COMLN, based on a distribution of tasks p(τ), with references to the relevant propositions developed in Appendices B and C.
Open Source Code Yes Code is available at: https://github.com/tristandeleu/jax-comln
Open Datasets Yes We evaluate COMLN on two standard few-shot image classification benchmarks: the mini Image Net (Vinyals et al., 2016) and the tiered Image Net datasets (Ren et al., 2018), both datasets being derived from ILSVRC-2012 (Russakovsky et al., 2015).
Dataset Splits Yes mini Imagenet consists of 100 classes, split into 64 training classes, 16 validation classes, and 20 test classes.
Hardware Specification Yes The extrapolated dashed lines correspond to the method reaching the memory capacity of a Tesla V100 GPU with 32Gb of memory.
Software Dependencies No The paper mentions 'JAX (Bradbury et al., 2018)' and 'Haiku (Hennigan et al., 2020)', but it does not specify version numbers for these software dependencies. It also refers to a '4th order Runge-Kutta method' which is a type of numerical solver, but no specific software library version for its implementation.
Experiment Setup Yes To compute the adapted parameters and the meta-gradients in COMLN, we integrate the dynamical system described in Section 4.2 with a 4th order Runge-Kutta method with a Dormand Prince adaptive step size... Furthermore to ensure that T > 0, we parametrized it with an exponential activation... For all methods and all datasets, we used SGD with momentum 0.9 and Nesterov acceleration, with a decreasing learning rate starting at 0.1 and decreasing according to the schedule provided by Lee et al. (2019).