Understanding Edge-of-Stability Training Dynamics with a Minimalist Example

Authors: Xingyu Zhu, Zixuan Wang, Xiang Wang, Mo Zhou, Rong Ge

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study Eo S phenomenon by constructing a simple function that has the same behavior. We give rigorous analysis for its training dynamics in a large local region and explain why the final converging point has sharpness close to 2/η. Globally we observe that the training dynamics for our example have an interesting bifurcating behavior, which was also observed in the training of neural nets. ... Finally, in Section 6 we present the similarity between our minimalist model and the GD trajectory of some over-parameterized deep neural networks trained on a real-world dataset.
Researcher Affiliation Academia Xingyu Zhu 1, Zixuan Wang 2, Xiang Wang1, Mo Zhou1, Rong Ge1 1Duke University, 2Tsinghua University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements or links regarding the availability of source code for the described methodology.
Open Datasets Yes We train a 5-layer ELU-activated fully connected network on a 2-class small subset of CIFAR-10 (Krizhevsky et al., 2009)...
Dataset Splits No For the CIFAR-10 subset experiment {(x1, yi)}n i=1 where n = 50, xi R3072, and yi {1, 1}, we consider the mean squared loss... The paper describes creating a 50-sample subset but does not specify any training, validation, or test splits for this data.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No For neural networks, we use the Py Hessian package by (Yao et al., 2020), which compute the top eigenvector eigenvalue pair by inferencing the Hessian vector product and do power iteration. For all numerical sharpness computed for neural networks, we set tol=1e-6 and max iter=10000. While 'Py Hessian' is mentioned, no version number for this or any other software dependency is provided.
Experiment Setup Yes We train a 5-layer ELU-activated fully connected network... with GD. The loss converges to 0 and the sharpness converges to just slightly below 2/η. ... We train the model using with η = 0.01 for 18500 iterations. ... We use Xavier initialization (Glorot & Bengio, 2010) with gain of 1 to initialize the all weight matrices. ... For all numerical sharpness computed for neural networks, we set tol=1e-6 and max iter=10000.