SGD vs GD: Rank Deficiency in Linear Networks
Authors: Aditya Vardhan Varre, Margarita Sagitova, Nicolas Flammarion
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this article, we study the behaviour of continuous-time gradient methods on a two-layer linear network with square loss. A dichotomy between SGD and GD is revealed: GD preserves the rank at initialization while (label noise) SGD diminishes the rank regardless of the initialization. We demonstrate this rank deficiency by studying the time evolution of the determinant of a matrix of parameters. To further understand this phenomenon, we derive the stochastic differential equation (SDE) governing the eigenvalues of the parameter matrix. This SDE unveils a replusive force between the eigenvalues: a key regularization mechanism which induces rank deficiency. Our results are well supported by experiments illustrating the phenomenon beyond linear networks and regression tasks. |
| Researcher Affiliation | Academia | Aditya Varre EPFL aditya.varre@epfl.ch Margarita Sagitova EPFL margarita.sagitova@epfl.ch Nicolas Flammarion EPFL nicolas.flammarion@epfl.ch |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for its methodology within the main paper or its appendices. (The NeurIPS Paper Checklist, which is metadata about the paper, states code is provided, but this is not part of the research paper content.) |
| Open Datasets | No | We consider a regression problem on synthetic data with n = 1000 samples of Gaussian data in R5 (p = 5) with labels in R2 (k = 2) generated by some ground truth β R5 2, the width of the network is l = 10. We use Gaussian initialization of the network parameters with entries from N(0, 1). |
| Dataset Splits | No | The paper uses synthetic data and discusses its generation but does not specify training, validation, or test dataset splits. |
| Hardware Specification | Yes | The experiments were run on a Intel i5-8250U, 8-GB RAM, with OS Ubuntu 20.04.6. |
| Software Dependencies | No | All experiments are implemented with Python 3 [Van Rossum and Drake, 2009] under PSF license, Num Py [Harris et al., 2020] under BSD license, and Py Torch [Paszke et al., 2019] under BSD-3Clause license. |
| Experiment Setup | Yes | To numerically emulate GF (Figure 1), we set a stepsize of 1e 6 in numerical simulation. We consider a regression problem on synthetic data with n = 1000 samples of Gaussian data in R5 (p = 5) with labels in R2 (k = 2) generated by some ground truth β R5 2, the width of the network is l = 10. We use Gaussian initialization of the network parameters with entries from N(0, 1). ... As seen in the left plot of the Figure 2, when the stepsize is large (η = 0.1), singular values exhibit behavior similar to the case of LNGF, while with the small stepsize (η = 0.005) the evolution of singular values is closer to GF case. |