Investigating Generalization by Controlling Normalized Margin

Authors: Alexander R Farhang, Jeremy D Bernstein, Kushal Tirumala, Yang Liu, Yisong Yue

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper designs a series of experimental studies that explicitly control normalized margin and thereby tackle two central questions.
Researcher Affiliation Collaboration Alexander R. Farhang 1 Jeremy Bernstein 1 Kushal Tirumala 1 Yang Liu 2 Yisong Yue 1 2 1Caltech 2Argo AI.
Pseudocode Yes Recipe 1 Controlling Frobenius-normalized margin γF. The recipe targets γF (xi, yi; w) = αi across training points {xi, yi}n i=1 for an L-layer MLP f L(x; w).
Open Source Code Yes Code available at: https://github.com/alexfarhang/margin.
Open Datasets Yes Two sets of experiments were performed, each of which trained two MLPs on 1000 point subsets of MNIST to classify either true or randomly labeled data for 10-class classification. For MNIST 0 vs. 1 classification, the training set size was 12665 and test set size was 2115. For CIFAR-10 dog vs. ship, the training set size was 10000 and test set size was 2000.
Dataset Splits No For MNIST 0 vs. 1 classification, the training set size was 12665 and test set size was 2115. For MNIST 4 vs. 7 classification, the training set size was 12107 and test size was 2010. For MNIST 3 vs. 8 classification, the training set size was 11982 and test set size was 1984. For CIFAR-10 dog vs. ship, the training set size was 10000 and test set size was 2000.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies No This paper employs the Nero optimizer (Liu et al., 2021)...
Experiment Setup Yes Depth 5, width 5000 fully connected neural networks were trained for 10-class classification on subsets of 1000 training points from MNIST... Rectified Linear unit (Re LU) activations were used throughout all experiments. ...trained with a label-scaled squared loss function... full batch gradient descent with a learning rate of 0.01 and and an exponential learning rate decay of 0.999... trained with Frobenius control using the Nero optimizer (learning rate: 0.01, Nero β: 0.999)... 2-layer MLPs were trained for 10-class classification on 1000 point subsets of MNIST. ...Networks were trained between 50,000 to 250,000 epochs (learning rates between 0.9998 and 0.999998).