Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability

Authors: Zixuan Wang, Zhouzi Li, Jian Li

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically identify the norm of output layer weight as an interesting indicator of the sharpness dynamics. Based on this empirical observation, we attempt to theoretically and empirically explain the dynamics of various key quantities that lead to the change of the sharpness in each phase of EOS. Moreover, based on certain assumptions, we provide a theoretical proof of the sharpness behavior in the EOS regime in two-layer fully-connected linear neural networks.
Researcher Affiliation Academia Zhouzi Li IIIS, Tsinghua University zhouzi188763@gmail.com Zixuan Wang IIIS, Tsinghua University wangzx2019012326@gmail.com Jian Li IIIS, Tsinghua University lapordge@gmail.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets Yes As illustrated in Figure 1, we train a shallow neural network by gradient descent on a subset of 1,000 samples from CIFAR-10 (Krizhevsky et al. [17]), using the MSE loss as the objective.
Dataset Splits No The paper mentions using a 'subset of 1,000 samples from CIFAR-10' but does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No The paper mentions 'Pytorch' in its bibliography but does not provide specific ancillary software details with version numbers (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup No The paper describes training a 'shallow neural network' using 'gradient descent' and 'MSE loss' on a 'subset of 1,000 samples from CIFAR-10'. However, it does not provide specific experimental setup details such as concrete hyperparameter values (e.g., learning rate, batch size, number of epochs), optimizer settings, or detailed training configurations.