Second-order regression models exhibit progressive sharpening to the edge of stability

Authors: Atish Agarwala, Fabian Pedregosa, Jeffrey Pennington

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we conduct a numerical analysis on the properties of a real neural network and use tools from our theoretical analysis to show that edge-ofstability behavior in the wild shows some of the same patterns as the theoretical models. We conducted numerical experiments in real world models, and compare the behavior to our theory on simplified models. Following (Cohen et al., 2022a), we trained a 2-hidden layer tanh network using the squared loss on 5000 examples from CIFAR10 with learning rate 10−2 a setting which shows edge of stability behavior.
Researcher Affiliation Industry 1Google Deep Mind. Correspondence to: Atish Agarwala <thetish@google.com>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper mentions using the 'Neural Tangents' library but does not provide concrete access to the source code for the methodology described in this paper.
Open Datasets Yes Following (Cohen et al., 2022a), we trained a 2-hidden layer tanh network using the squared loss on 5000 examples from CIFAR10 with learning rate 10−2 a setting which shows edge of stability behavior.
Dataset Splits No The paper mentions using '5000 examples from CIFAR10' but does not specify the training, validation, or test dataset splits needed to reproduce the experiment.
Hardware Specification No The paper states 'All experiments were conducted on GPU with float32 precision' but does not provide specific hardware details such as exact GPU/CPU models or processor types.
Software Dependencies No The paper mentions using the 'Neural Tangents library' but does not provide specific version numbers for this or any other software dependencies.
Experiment Setup Yes we trained a 2-hidden layer tanh network using the squared loss on 5000 examples from CIFAR10 with learning rate 10−2 (Section 5) and Models were 2-hidden layer fully-connected networks, with hidden width 256 and Erf non-linearities. Models were initialized with the NTK parameterization, with weight variance 1 and bias variance 0... A learning rate of 0.003204 was used in all experiments. All plots were made using float-64 precision. (Appendix D.3).