Pushing Boundaries: Mixup's Influence on Neural Collapse

Authors: Quinn LeBlanc Fisher, Haoming Meng, Vardan Papyan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our investigation (code), spanning various architectures and dataset pairs, reveals that mixup s last-layer activations predominantly converge to a distinctive configuration different than one might expect. ... We conduct an extensive empirical study focusing on the lastlayer activations of mixup training data. Our study reveals that mixup induces a geometric configuration of last-layer activations across various datasets and models. ... The results of our extensive empirical investigation are presented in Figures 1, 3, 5, 10, and 12. These figures collectively illustrate a consistent identification of a unique last-layer configuration induced by mixup, observed across a diverse range of: Architectures: Our study incorporated the Wide Res Net-40-10 (Zagoruyko & Komodakis, 2017) and Vi T-B (Dosovitskiy et al., 2021) architectures; Datasets: The datasets employed included Fashion MNIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky & Hinton, 2009); Optimizers: We used stochastic gradient descent (SGD), Adam (Kingma & Ba, 2017), and Adam W (Loshchilov & Hutter, 2017) as optimizers. The networks trained showed good generalization performance and calibration, as substantiated by the data presented in Tables 1 and 2.
Researcher Affiliation Academia Quinn Le Blanc Fisher , Haoming Meng , Vardan Papyan University of Toronto
Pseudocode No The paper includes mathematical derivations and descriptions of algorithms (like the projection method in Appendix B.2), but it does not present any formal pseudocode or algorithm blocks.
Open Source Code No The abstract mentions "Our investigation (code)", but there is no explicit statement of code release, no link to a repository, and no mention of code in supplementary materials within the provided PDF.
Open Datasets Yes We consider Fashion MNIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky & Hinton, 2009) datasets.
Dataset Splits No The paper discusses training and testing, but does not explicitly specify the training/validation/test dataset splits (e.g., percentages or sample counts for a distinct validation set) needed to reproduce the data partitioning.
Hardware Specification No The paper acknowledges support from Compute Ontario and Compute Canada, implying computational resources were used, but does not provide specific details such as GPU models, CPU models, or memory specifications.
Software Dependencies No The paper mentions using specific optimizers like SGD, Adam, and AdamW, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.).
Experiment Setup Yes Hyperparameter details are outlined in Appendix B.1. ... For the Wide Res Net experiments, we minimize the mixup loss using stochastic gradient descent (SGD) with momentum 0.9 and weight decay 1 10 4. All datasets are trained on a Wide Res Net40-10 for 500 epochs with a batch size of 128. We sweep over 10 logarithmically spaced learning rates between 0.01 and 0.25, picking whichever results in the highest test accuracy. The learning rate is annealed by a factor of 10 at 30%, 50%, and 90% of the total training time. For the Vi T experiments, we minimize the mixup loss using Adam optimization (Kingma & Ba, 2017). For each dataset we train a Vi T-B with a patch size of 4 for 1000 epochs with a batch size of 128. We sweep over 10 logarithmically spaced learning rates from 1 10 4 to 3 10 3 and weight decay values from 0 to 0.05, selecting whichever yields the highest test accuracy. The learning rate is warmed up for 10 epochs and is annealed using cosine annealing as a function of total epochs.