Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Understanding and Improving Transfer Learning of Deep Models via Neural Collapse
Authors: Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, Qing Qu
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work investigates the relationship between neural collapse (NC) and transfer learning for classification problems. NC is an intriguing while prevalent phenomenon that has been recently discovered in terms of the final-layer features and linear classifiers of trained neural networks. Specifically, during the terminal phase of training, NC implies that the variability of the features within each class diminishes to zero, while the means of features between classes are maximally and equally distanced. In this work, we examine the NC attributes of pre-trained models on both downstream and training data for transfer learning, and we find strong correlation between feature collapse and downstream performance. In particular, we discovered a systematic pattern that emerges when linear probing pre-trained models on downstream training data: the more feature collapse of pre-trained models on downstream data, the higher the transfer accuracy. Additionally, we also studied the relationship between NC and transfer accuracy on the training data. Moreover, these findings allow us to develop a principled, parameter-efficient fine-tuning method that employs skip-connection to induce the last-layer feature collapse on downstream data. Our proposed fine-tuning methods deliver good performances while reducing fine-tuning parameters by at least 90% and mitigating overfitting in situations especially when the downstream data is scarce. This phenomenon is supported by comprehensive experiments conducted on multiple downstream datasets (Krizhevsky et al., 2009; Maji et al., 2013; Cimpoi et al., 2014; Parkhi et al., 2012), diverse pre-trained models (He et al., 2016; Dosovitskiy et al., 2021; Huang et al., 2017; Sandler et al., 2018; Radford et al., 2021), and even within the context of few-shot learning (Tian et al., 2020; Liu et al., 2023b). |
| Researcher Affiliation | Academia | Xiao Li EMAIL Department of Electrical Engineering and Computer Science, University of Michigan Sheng Liu EMAIL Biomedical Data Sciences, Stanford University Jinxin Zhou EMAIL Department of Computer Science and Engineering, Ohio State University Xinyu Lu EMAIL School of Computer Science, Carnegie Mellon University Carlos Fernandez-Granda EMAIL Center for Data Science, New York University Zhihui Zhu EMAIL Department of Computer Science and Engineering, Ohio State University Qing Qu EMAIL Department of Electrical Engineering and Computer Science, University of Michigan |
| Pseudocode | No | The paper describes methods and strategies, but it does not include any explicit pseudocode or algorithm blocks. The methods are explained in descriptive text. |
| Open Source Code | No | The paper mentions that 'The checkpoint used for Vi T-B can be found here.' in footnote 5, referring to a pre-trained model. However, it does not explicitly state that the authors' own implementation code for the methodology described in the paper is openly available or provide a link to a repository for their code. |
| Open Datasets | Yes | More collapsed features on downstream tasks lead to better transfer accuracy. Through an extensive examination of pre-trained models using linear probing across various scenarios, our work shows the following relationship: a higher degree of feature collapse on downstream data tends to yield improved transfer accuracy. This phenomenon is supported by comprehensive experiments conducted on multiple downstream datasets (Krizhevsky et al., 2009; Maji et al., 2013; Cimpoi et al., 2014; Parkhi et al., 2012), diverse pre-trained models (He et al., 2016; Dosovitskiy et al., 2021; Huang et al., 2017; Sandler et al., 2018; Radford et al., 2021), and even within the context of few-shot learning (Tian et al., 2020; Liu et al., 2023b). We transfer the pre-trained models to four different downstream datasets: Cifar-10 (Krizhevsky et al., 2009), FGVC-Aircraft (Maji et al., 2013), DTD (Cimpoi et al., 2014) and Oxford-IIITPet (Parkhi et al., 2012). We conducted experiments utilizing CLIP (Dosovitskiy et al., 2021), a model trained on matching image and caption pairs without reliance on labeling information. Specifically, we employed the image encoder of CLIP as a frozen feature extractor to extract features from diverse datasets including Cifar (Krizhevsky et al., 2009), DTD (Cimpoi et al., 2014), FGVC-Aircraft (Maji et al., 2013), Oxford102-Flower (Nilsback & Zisserman, 2008) and SUN397 (Xiao et al., 2010). |
| Dataset Splits | Yes | More specifically, this negative correlation between downstream NC1 and transfer accuracy also applies to the fewshot (FS) learning settings, as shown in Table 2 for mini Image Net and CIFAR-FS datasets. Following (Tian et al., 2020), we pre-train different models on the merged meta-training data, and then freeze the models, and learn a linear classifier at meta-testing time. We conducted fine-tuning experiments on pre-trained Res Net18/CLIP models using varying sizes of subsets from the Cifar-10/Cifar-100 training samples. The outcomes are presented in Figure 8. We transfer the Image Net-1k pre-trained Res Net18 model on subsets of Cifar-10 with varying sizes. More specifically, we select the sizes from a logarithmically spaced list of values: [3, 10, 32, 100, 316] for each class. We also conduct the same experiment using the transformer-based architecture, for which we fine-tune pre-trained CLIP model using subsets of Cifar-100 dataset with sizes of each class from: [7, 17, 38, 87, 200]. |
| Hardware Specification | Yes | General experiment setups. We perform all experiments using a single NVIDIA A40 GPU, most experiments could be finished in less than 4 hours. |
| Software Dependencies | No | The paper mentions 'SGD with a momentum of 0.9, a weight decay of 1 10 4, and a dynamically changing learning rate ranging from 1 10 1 to 1 10 4 controlled by a Cosine Annealing learning rate scheduler as described in (Loshchilov & Hutter, 2017).' and 'We use the multi-class logistic regression implemented in scikit-learn (Pedregosa et al., 2011) for the base classifier.' While specific algorithms and tools are named, explicit version numbers for general software libraries like Python, PyTorch, or TensorFlow, or for scikit-learn itself, are not provided. |
| Experiment Setup | Yes | Unless otherwise specified, all pre-training and transfer learning are run for 200 epochs using SGD with a momentum of 0.9, a weight decay of 1 10 4, and a dynamically changing learning rate ranging from 1 10 1 to 1 10 4 controlled by a Cosine Annealing learning rate scheduler as described in (Loshchilov & Hutter, 2017). When using Image Net pre-trained models, we resize each input image to 224 224 for training, testing, and evaluating NC. In terms of hyperparameter selection, we consider 2 learning rates ([1e 1, 5e 2] for Res Net models and [1e 2, 1e 3] for Vi T) and 2 weight decays ([1e 4, 5e 4] for Res Net models and [1e 4, 0.0] for Vi T) to identify the optimal settings for each method. We report the final selected hyperparameters in Table 4. |