Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models
Authors: Courtney Paquette, Elliot Paquette
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We analyze a class of stochastic gradient algorithms with momentum on a highdimensional random least squares problem. Our framework, inspired by random matrix theory, provides an exact (deterministic) characterization for the sequence of loss values produced by these algorithms which is expressed only in terms of the eigenvalues of the Hessian. This leads to simple expressions for nearly-optimal hyperparameters, a description of the limiting neighborhood, and average-case complexity. As a consequence, we show that (small-batch) stochastic heavy-ball momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. For contrast, in the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum. By introducing hyperparameters that depend on the number of samples, we propose a new algorithm SDANA (stochastic dimension adjusted Nesterov acceleration) which obtains an asymptotically optimal average-case complexity while remaining linearly convergent in the strongly convex setting without adjusting parameters. |
| Researcher Affiliation | Collaboration | Courtney Paquette Department of Mathematics and Statistics Mc Gill University Montreal, Quebec H2Y 2M5 courtney.paquette@mcgill.ca Elliot Paquette Department of Mathematics and Statistics Mc Gill University Montreal, Quebec H2Y 2M5 elliot.paquette@mcgill.ca [...] C. Paquette has part-time employment at Google Research, Brain Team, Montreal, QC. |
| Pseudocode | Yes | Algorithm 1 Generic stochastic momentum method. |
| Open Source Code | No | The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Figure 6: SDANA & SGD vs Theory on MNIST. MNIST (60000 28 28 images) [Le Cun et al., 2010] is reshaped into 10 matrices of dimension 1000 4704, representing 1000 samples of groups of 6 digits (preconditioned to have centered rows of norm-1). |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts) or references predefined splits with citations for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper does not list specific software components with version numbers required for reproducibility (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x). |
| Experiment Setup | Yes | Table 1: Summary of the parameters for a variety of stochastic momentum algorithms that fit within the framework of Alg. 1, denote the normalized trace by m def = n 1 Pn i=1 kaik2. The default parameters are chosen so that its linear rate is no slower, by a factor of 4 than the fastest possible rate for an algorithm having optimized over all step size choices. Figure 3: Convergence of SDANA. Halting time of SDANA vs SGD with default parameters on the Gaussian random least squares problem (2.1) with varying d and n = 1024. |