Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data

Authors: Hien Dang, Tho Tran Huu, Stanley Osher, Hung The Tran, Nhat Ho, Tan Minh Nguyen

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results demonstrate the convergence of the last-layer features and classifiers to a geometry consisting of orthogonal vectors, whose lengths depend on the amount of data in their corresponding classes. Finally, we empirically validate our theoretical analyses on synthetic and practical network architectures with both balanced and imbalanced scenarios. 5. Experimental Results In this section, we empirically verify our theoretical results in multiple settings for both balanced and imbalanced data. In particular, we observe the evolution of NC properties in the training of deep linear networks with a prior backbone feature extractor (e.g., MLP, ResNet18) to create the unconstrained features (see Fig. 1 for a sample visualization). The experiments are performed on CIFAR10 (Krizhevsky et al., 2009) dataset and EMNIST letter (Cohen et al., 2017) dataset for the image classification task.
Researcher Affiliation Collaboration Hien Dang * 1 Tho Tran * 1 Stanley Osher 2 Hung Tran-The 3 Nhat Ho ** 4 Tan Nguyen ** 5 * , **Equal contribution 1FPT Software AI Center, Vietnam 2Department of Mathematics, University of California, Los Angeles, USA 3Applied Artificial Intelligence Institute, Deakin University, Victoria, Australia 4Department of Statistics and Data Sciences, University of Texas at Austin, USA 5Department of Mathematics, National University of Singapore, Singapore. Correspondence to: Hien Dang <danghoanghien1123@gmail.com>, Tho Tran <thotranhuu99@gmail.com>.
Pseudocode No No structured pseudocode or algorithm blocks are present in the paper. The methodology is described through mathematical derivations and prose.
Open Source Code No No explicit statement about releasing source code or a link to a code repository for the methodology described in this paper is provided.
Open Datasets Yes The experiments are performed on CIFAR10 (Krizhevsky et al., 2009) dataset and EMNIST letter (Cohen et al., 2017) dataset for the image classification task. To verify the results are consistent through different dataset, we also conduct experiments on text classification tasks in Appendix C.1.2. on 4 subsets of text classification datasets including: AG News, IMDB, Sogou News, and Yelp Review Polarity datasets.
Dataset Splits No The paper describes how certain training subsets were created, including specific counts per class for imbalanced data: "We choose a random subset of CIFAR10 dataset with number of training samples of each class chosen from the list {500, 500, 400, 400, 300, 300, 200, 200, 100, 100}." and "Our training set is randomly sampled from the EMNIST letter training set. The number of training samples is as followed: 1 major class with 1500 samples, 5 medium class with 600 samples per class, and 20 minor classes with 50 sample per class.". However, it does not explicitly state training/validation/test splits in percentages or absolute numbers that would allow full reproduction of data partitioning for all experiments.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory, or processing units) used for running experiments are explicitly provided. The paper discusses models and datasets but omits hardware specifications.
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9, TensorFlow 2.x) are explicitly listed. The paper mentions using "Adam optimizer (Kingma & Ba, 2014)" and "stochastic gradient descent (SGD)" but without versions for the underlying libraries.
Experiment Setup Yes All models are trained with Adam optimizer with MSE loss for 200 epochs with batch size 128 and learning rate 1e-4 (divided by 10 every 50 epochs). Weight decay and feature decay are set to 1e-4. For ResNet18 backbone models, we use the learning rate of 0.05 and weight decay of 2e-4. Each model is trained with SGD optimizer, batch size 128 and MSE loss until convergence. We perform hyperparameter search with learning rate {1e-4, 5e-4, 0.001, 0.005, 0.01}.