High-dimensional SGD aligns with emerging outlier eigenspaces
Authors: Gerard Ben Arous, Reza Gheissari, Jiaoyang Huang, Aukosh Jagannath
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We rigorously study the joint evolution of training dynamics via stochastic gradient descent (SGD) and the spectra of empirical Hessian and gradient matrices. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, the SGD trajectory rapidly aligns with emerging low-rank outlier eigenspaces of the Hessian and gradient matrices. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer s outlier eigenspace evolving over the course of training, and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers. This establishes some of the rich predictions that have arisen from extensive numerical studies in the last decade about the spectra of Hessian and information matrices over the course of training in overparametrized networks. |
| Researcher Affiliation | Academia | G erard Ben Arous Courant Institute New York University New York, NY, USA benarous@cims.nyu.edu Reza Gheissari Department of Mathematics Northwestern University Evanston, IL, USA gheissari@northwestern.edu Jiaoyang Huang Department of Statistics and Data Science University of Pennsylvania Philadelphia, PA, USA huangjy@wharton.upenn.edu Aukosh Jagannath Department of Statistics and Actuarial Science Department of Applied Mathematics, and Cheriton School of Computer Science University of Waterloo Waterloo, ON, Canada a.jagannath@uwaterloo.ca |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statements or links regarding the release of open-source code for the described methodology. |
| Open Datasets | No | Let C = [k] be the collection of classes, with corresponding distinct class means (µa)a [k] Rd, covariance matrices Id/λ, where λ > 0 can be viewed a signal-to-noise parameter, and corresponding probabilities 0 < (pa)a [k] < 1 such that P a [k] pa = 1. ... We imagine we have two data sets, a training set (Yℓ)M ℓ=1 and a test set ( e Yℓ)f M ℓ=1, all drawn i.i.d. from PY. (The paper describes generating synthetic data based on Gaussian mixture models, rather than using a pre-existing, publicly available dataset with concrete access information or formal citation.) |
| Dataset Splits | No | We imagine we have two data sets, a training set (Yℓ)M ℓ=1 and a test set ( e Yℓ)f M ℓ=1, all drawn i.i.d. from PY. ... M d log d rather than simply M d. (The paper mentions training and test sets but does not specify explicit numerical splits like percentages or counts, or refer to a standard split from a cited source.) |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used to run the experiments or simulations. |
| Software Dependencies | No | The paper does not list any specific software dependencies with version numbers. |
| Experiment Setup | Yes | The (online) stochastic gradient descent with initialization x0 and learning rate, or step-size, δ, will be run using the training set (Yℓ)M ℓ=1 as follows: xℓ= xℓ 1 δ L(xℓ 1, Yℓ) βxℓ 1 . (2.1) |