Are All Losses Created Equal: A Neural Collapse Perspective

Authors: Jinxin Zhou, Chong You, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, Zhihui Zhu

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments further show that NC features obtained from all relevant losses (i.e., CE, LS, FL, MSE) lead to largely identical performance on test data as well, provided that the network is sufficiently large and trained until convergence. ... We also provide an experimental verification of this claim through experiments in Section 4.1.
Researcher Affiliation Collaboration Jinxin Zhou Ohio State University zhou.3820@osu.edu Chong You Google Research cyou@google.com Xiao Li University of Michigan xlxiao@umich.edu Kangning Liu New York University kl3141@nyu.edu Sheng Liu New York University shengliu@nyu.edu Qing Qu University of Michigan qingqu@umich.edu Zhihui Zhu Ohio State University zhu.3440@osu.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The source code is available at https://github.com/jinxinzhou/nc_loss.
Open Datasets Yes We train a Wide Res Net50 network [45] on CIFAR10 and CIFAR100 datasets [46] and a Wide Res Net18 network on mini Image Net [47] with various widths and number of iterations for image classification using these four different losses.
Dataset Splits Yes the test accuracy is reported based on the model with best accuracy on validation set, where we organize the validation set by holding out 10 percent data from the training set.
Hardware Specification No The provided text does not explicitly describe the specific hardware used to run the experiments (e.g., GPU models, CPU types, or cloud instance details). Appendix A is referenced for this information but not provided in the text.
Software Dependencies No The provided text does not specify software dependencies with version numbers.
Experiment Setup Yes For optimization, we use SGD with momentum 0.9 and an initial learning rate 0.1 decayed by a factor of 0.1 at 3/7 of the total number of iterations. Following [28], the norm of gradient is clipped at 2 which can improve performance for all losses. For CIFAR10 and mini Image Net, the weight decay is set to 5e-4 for all configurations with all losses. For CIFAR100, the weight decay is fine-tuned to achieve best accuracy for every configuration and loss.