Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability
Authors: Alex Damian, Eshaan Nichani, Jason D. Lee
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify that the predicted dynamics defined in eq. (5) accurately capture the dynamics of gradient descent at the edge of stability by replicating the experiments in (Cohen et al., 2021) and tracking the deviation of gradient descent from the constrained trajectory. In Figure 3, we evaluate our theory on a 3-layer MLP and a 3-layer CNN trained with mean squared error (MSE) on a 5k subset of CIFAR10 and a 2-layer Transformer (Vaswani et al., 2017) trained with MSE on SST2 Socher et al. (2013). |
| Researcher Affiliation | Academia | Alex Damian*, Eshaan Nichani* & Jason D. Lee Princeton University {ad27,eshnich,jasonlee}@princeton.edu |
| Pseudocode | Yes | Definition 6 (Predicted Dynamics, full). Define v0 = v0, and let xt = vt ut, yt = S vt. Then v t+1 = P ut+1(I η 2Lt)P utv t + ηP ut+1 S t (1 + ηy t )x t ut+1 (6)" and "Definition 7. Given a vector v and a timestep t, define stept(v) by P ut+1 stept(v) = P ut+1 (I η 2Lt)P utv + η S t ut+1 stept(v) = (1 + ηy)x. (8) |
| Open Source Code | Yes | Our code can be found at https://github.com/adamian98/EOS. |
| Open Datasets | Yes | We evaluate our theory on a 3-layer MLP and a 3-layer CNN trained with mean squared error (MSE) on a 5k subset of CIFAR10 and a 2-layer Transformer (Vaswani et al., 2017) trained with MSE on SST2 Socher et al. (2013). |
| Dataset Splits | No | We evaluate our theory on a 3-layer MLP and a 3-layer CNN trained with mean squared error (MSE) on a 5k subset of CIFAR10 and a 2-layer Transformer (Vaswani et al., 2017) trained with MSE on SST2 Socher et al. (2013). (This mentions the training data size but not the splits.) No explicit information about dataset splits (e.g., percentages or counts for train/validation/test) is provided. |
| Hardware Specification | No | All experiments were conducted on two servers, each with 10 NVIDIA GPUs. (This states the type and count of GPUs, but not specific models or other hardware details like CPU or memory.) |
| Software Dependencies | No | Our experiments were conducted in JAX (Bradbury et al., 2018), using https://github.com/ locuslab/edge-of-stability as a reference for replicating the experimental setup used in (Cohen et al., 2021). (JAX is mentioned but no version number.) |
| Experiment Setup | Yes | For every experiment, we tracked the gradient descent dynamics until they reached instability and then began tracking the constrained trajectory, gradient descent, gradient flow, and both our predicted dynamics (Section 5) and our generalized predicted dynamics (Appendix F)... we switched to computing gradients with 64-bit precision after first reaching instability to avoid propagating floating point errors." and "MLP+MSE on CIFAR10, η = 0.002" (from Figure 3 caption). |