High-dimensional limit theorems for SGD: Effective dynamics and critical scaling

Authors: Gerard Ben Arous, Reza Gheissari, Aukosh Jagannath

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate our approach on popular examples including estimation for spiked matrix and tensor models and classification via two-layer networks for binary and XOR-type Gaussian mixture models. These examples exhibit surprising phenomena including multimodal timescales to convergence as well as convergence to sub-optimal solutions with probability bounded away from zero from random (e.g., Gaussian) initializations. Figure 1: Matrix PCA summary statistics in dim. n = 1500 run for 10n steps at λ = 0.8 < λc in (a) (b) and λ = 1.2 > λc in (c) (d). Here, and mark the stable fixed points of the systems. (a) and (c) demonstrate the stable and unstable OU processes that arise as diffusive limits of the m variable, and (b) and (d) depict the trajectories in (m, r2). If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
Researcher Affiliation Academia Gérard Ben Arous Department of Mathematics Courant Institute New York University New York, NY benarous@cims.nyu.edu Reza Gheissari Miller Institute Departments of EECS and Statistics University of California, Berkeley Berkely, CA gheissari@berkeley.edu Aukosh Jagannath Department of Statistics and Actuarial Science Department of Applied Mathematics University of Waterloo Waterloo, ON a.jagannath@uwaterloo.ca
Pseudocode No The paper describes mathematical formulations and processes but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See supplemental material
Open Datasets No The paper uses data generated from theoretical models (spiked matrix models, Gaussian mixture models) rather than established, publicly available datasets. It describes the data generating process ('Y = (y, X), where y is a Ber(1/2) random variables and, conditionally on y, we have X N((2y 1)µ, I/λ)'). No concrete access information (link, DOI, etc.) to an external, publicly available dataset is provided.
Dataset Splits No The paper describes experimental results based on theoretical models but does not specify training, validation, or test data splits in terms of percentages or sample counts for any external dataset.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments (e.g., GPU/CPU models, memory).
Software Dependencies No The paper does not list any specific software dependencies with version numbers (e.g., Python, PyTorch, specific solvers).
Experiment Setup Yes We consider online stochastic gradient descent with constant learning rate, δn, which is given by X = X 1 δnr Ln(X 1, Y ). The critical scaling for δ is then of order (1/n) and we obtain the following effective dynamics. Proposition 4.1. Let un be as in (4.2) and fix any λ > 0 and δn = cδ/N. Then un(t) converges to the solution of the ODE system, ut = f(ut) + g(ut), initialized from limn!1(un) µn, with: For large λ, this is indeed a quantitative approximation as f, g exhibit locally Lipschitz dependence on λ 1, so the corresponding dynamics converges as λ ! 1 by classical well-posedness results (see, e.g., [48]). Figure 3: XOR GMM in dim. N = 250 with λ = 1000 and = 0.1.