Pushing Boundaries: Mixup's Influence on Neural Collapse
Authors: Quinn LeBlanc Fisher, Haoming Meng, Vardan Papyan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our investigation (code), spanning various architectures and dataset pairs, reveals that mixup s last-layer activations predominantly converge to a distinctive configuration different than one might expect. ... We conduct an extensive empirical study focusing on the lastlayer activations of mixup training data. Our study reveals that mixup induces a geometric configuration of last-layer activations across various datasets and models. ... The results of our extensive empirical investigation are presented in Figures 1, 3, 5, 10, and 12. These figures collectively illustrate a consistent identification of a unique last-layer configuration induced by mixup, observed across a diverse range of: Architectures: Our study incorporated the Wide Res Net-40-10 (Zagoruyko & Komodakis, 2017) and Vi T-B (Dosovitskiy et al., 2021) architectures; Datasets: The datasets employed included Fashion MNIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky & Hinton, 2009); Optimizers: We used stochastic gradient descent (SGD), Adam (Kingma & Ba, 2017), and Adam W (Loshchilov & Hutter, 2017) as optimizers. The networks trained showed good generalization performance and calibration, as substantiated by the data presented in Tables 1 and 2. |
| Researcher Affiliation | Academia | Quinn Le Blanc Fisher , Haoming Meng , Vardan Papyan University of Toronto |
| Pseudocode | No | The paper includes mathematical derivations and descriptions of algorithms (like the projection method in Appendix B.2), but it does not present any formal pseudocode or algorithm blocks. |
| Open Source Code | No | The abstract mentions "Our investigation (code)", but there is no explicit statement of code release, no link to a repository, and no mention of code in supplementary materials within the provided PDF. |
| Open Datasets | Yes | We consider Fashion MNIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky & Hinton, 2009) datasets. |
| Dataset Splits | No | The paper discusses training and testing, but does not explicitly specify the training/validation/test dataset splits (e.g., percentages or sample counts for a distinct validation set) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper acknowledges support from Compute Ontario and Compute Canada, implying computational resources were used, but does not provide specific details such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using specific optimizers like SGD, Adam, and AdamW, but does not provide specific version numbers for any software dependencies (e.g., Python, PyTorch, TensorFlow, etc.). |
| Experiment Setup | Yes | Hyperparameter details are outlined in Appendix B.1. ... For the Wide Res Net experiments, we minimize the mixup loss using stochastic gradient descent (SGD) with momentum 0.9 and weight decay 1 10 4. All datasets are trained on a Wide Res Net40-10 for 500 epochs with a batch size of 128. We sweep over 10 logarithmically spaced learning rates between 0.01 and 0.25, picking whichever results in the highest test accuracy. The learning rate is annealed by a factor of 10 at 30%, 50%, and 90% of the total training time. For the Vi T experiments, we minimize the mixup loss using Adam optimization (Kingma & Ba, 2017). For each dataset we train a Vi T-B with a patch size of 4 for 1000 epochs with a batch size of 128. We sweep over 10 logarithmically spaced learning rates from 1 10 4 to 3 10 3 and weight decay values from 0 to 0.05, selecting whichever yields the highest test accuracy. The learning rate is warmed up for 10 epochs and is annealed using cosine annealing as a function of total epochs. |