Two Sides of Meta-Learning Evaluation: In vs. Out of Distribution

Authors: Amrith Setlur, Oscar Li, Virginia Smith

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We categorize meta-learning evaluation into two settings: in-distribution [ID], in which the train and test tasks are sampled iid from the same underlying task distribution, and out-of-distribution [OOD], in which they are not. While most meta-learning theory and some FSL applications follow the ID setting, we identify that most existing few-shot classification benchmarks instead reflect OOD evaluation, as they use disjoint sets of train (base) and test (novel) classes for task generation. This discrepancy is problematic because as we show on numerous benchmarks meta-learning methods that perform better on existing OOD datasets may perform significantly worse in the ID setting.
Researcher Affiliation Academia Amrith Setlur1 Oscar Li2 Virginia Smith2 asetlur@cs.cmu.edu oscarli@cmu.edu smithv@cmu.edu 1Language Technologies Institute 2Machine Learning Department School of Computer Science, Carnegie Mellon University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code available at https://github.com/ars22/meta-learning-eval-id-vs-ood.
Open Datasets Yes A plethora of few-shot image classification benchmarks (e.g., mini Image Net (mini in short) [42], CIFAR-FS [4]) have been developed for FSL evaluation. ... In this setting, a popular benchmark is the FEMNIST [6] handwriting recognition dataset. ... Ren et al. [35] propose the use of Zappos [43] dataset as a meta-learning benchmark...
Dataset Splits Yes These benchmarks typically provide three disjoint sets of classes: base classes CB, validation classes CV , and novel classes CN... We are given a total of 3500 writers sampled iid from P(id) and we randomly partition them into a 2509/538/538 split for training, validation, and test tasks, following similar practices used in prior FL work [20, 7]. ... we randomly partition each base class s current examples into an approximate 80/20 split where the training tasks are constructed using the former and the latter is reserved for ID evaluation.
Hardware Specification Yes Experiments are conducted on a single RTX 2080 GPU unless explicitly stated otherwise.
Software Dependencies No The paper mentions code availability and instructions but does not list specific software dependencies with version numbers within the text.
Experiment Setup Yes All methods (except FOMAML on FEMNIST) are trained for 60 epochs with a learning rate of 1e-3 and a learning rate scheduler that decays by 0.1 at 40 and 50 epochs.