Extended Unconstrained Features Model for Exploring Deep Neural Collapse

Authors: Tom Tirer, Joan Bruna

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we empirically demonstrate the usefulness of our nonlinear extended UFM in modeling the NC phenomenon that occurs with practical networks.
Researcher Affiliation Academia 1Center for Data Science, New York University, New York 2Courant Institute of Mathematical Sciences, New York University, New York. Correspondence to: Tom Tirer <tirer.tom@gmail.com>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes Finally, we show the similarity of the NC metrics that are obtained for the nonlinear extended UFM in Figure 4 (rather than those in Figure 3) and metrics obtained by a practical well-trained DNN, namely Res Net18 (He et al., 2016) (composed of 4 Res Blocks), trained on MNIST with SGD with learning rate 0.05 (divided by 10 every 40 epochs) and weight decay (L2 regularization) of 5e-4. Figure 5 shows the results for two cases: 1) MSE loss without bias in the FC layer; and 2) the widely-used setting, with cross-entropy loss and bias. (Additional experiments with CIFAR10 dataset appear in Appendix G).
Dataset Splits No The paper mentions training on MNIST and CIFAR10 but does not specify the train/validation/test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions Res Net18 but does not provide specific software dependencies with version numbers (e.g., library names, framework versions).
Experiment Setup Yes Figure 1 corroborates Theorem 3.1 for K = 4, d = 20, n = 50 and λW = λH = 0.005 (no bias is used, equivalently λb ). Both W and H are initialized with standard normal distribution and are optimized with plain gradient descent with step-size 0.1. ... trained on MNIST with SGD with learning rate 0.05 (divided by 10 every 40 epochs) and weight decay (L2 regularization) of 5e-4.