On the Role of Neural Collapse in Transfer Learning
Authors: Tomer Galanti, András György, Marcus Hutter
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we experimentally analyze the neural collapse phenomenon and how it generalizes to new data points and new classes. We use reasonably good classifiers to demonstrate that, in addition to the neural collapse observed in training time by Papyan et al. (2020), it is also observable on test data from the same classes, as well as on data from new classes, as predicted by our theoretical results. We also show that, as expected intuitively, neural collapse is strongly correlated with accuracy in few-shot learning scenarios. The experiments are conducted over multiple datasets and multiple architectures, providing strong empirical evidence that neural collapse provides a compelling explanation for the good performance of foundation models in few-shot learning tasks. Experimental results are reported averaged over 20 random initialization together with 95% confidence intervals. |
| Researcher Affiliation | Collaboration | Tomer Galanti MIT galanti@mit.edu Andr as Gy orgy Deep Mind agyorgy@deepmind.com Marcus Hutter Deep Mind mhutter@deepmind.com |
| Pseudocode | No | The paper describes the methods and processes through mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing source code for the methodology described, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Datasets. We consider four different datasets: (i) Mini-Image Net (Vinyals et al., 2016); (ii) CIFAR-FS (Bertinetto et al., 2019); (iii) FC-100 (Oreshkin et al., 2018); and (iv) EMNIST (balanced) (Cohen et al., 2017). |
| Dataset Splits | Yes | Each dataset is split into meta-train, meta-validation and meta-test classes; we select the data for the source classes from the meta-training, and use similarly the meta-test data for the target tasks (we do not use the meta-validation classes). Each one of the class splits is also partitioned into train and test samples; we use these for training and evaluating our models. The Mini-Image Net dataset contains 100 classes randomly chosen from Image Net ILSVRC-2012 (Russakovsky et al., 2015) with 600 images of size 84 84 pixels per class. It is split into 64 meta-training classes, 16 meta-validation classes and 20 classes for meta-testing. |
| Hardware Specification | No | The paper does not explicitly describe the hardware (e.g., specific GPU or CPU models) used for running its experiments. |
| Software Dependencies | No | The paper mentions software components like "SGD" and "Res Nets" but does not provide specific version numbers for any libraries, frameworks, or programming languages. |
| Experiment Setup | Yes | The training is conducted using SGD with learning rate η and momentum 0.9 with batch size 64. Here, g is the top linear layer of the neural network and f is the mapping implemented by all other layers. At the second stage, given a target few-shot classification task with training data S = {(xi, yi)}n i=1, we train a new top layer g as a solution of ridge regression acting on the dataset {(f(xi), yi)}n i=1 with regularization λn = α n. Thus, g is a linear transformation with the weight matrix w S,f = (f(X) f(X) + λn I) 1f(X) Y , where f(X) is the n d data matrix for the ridge regression problem containing the feature vectors {f(xi)} (i.e., f(X) = [f(x1), . . . , f(xn)]) and Y is the n k label matrix (i.e., Y = [y1, . . . , yn]), where X Rn d, Y Rn k and fθ(X) Rn p. We did not apply any form of fine-tuning for f at the second stage. In the experiments we sample 5-class classification tasks randomly from the target dataset, with nc training samples for each class (thus, altogether n = 5nc above), and measure the performance on 100 random test samples from each class. We report the resulting accuracy rates averaged over 100 randomly chosen tasks. Architectures and hyperparameters. We experimented with two types of architectures for h: wide Res Nets (Zagoruyko & Komodakis, 2016) and vanilla convolutional networks of the same structure without the residual connections, denoted by WRN-N-M and Conv-N-M, where N is the depth and M is the width factor. We used the following default hyperparameters: η = 2 4, batch size 64 and α = 1. |