Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

Authors: Courtney Paquette, Elliot Paquette

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We analyze a class of stochastic gradient algorithms with momentum on a highdimensional random least squares problem. Our framework, inspired by random matrix theory, provides an exact (deterministic) characterization for the sequence of loss values produced by these algorithms which is expressed only in terms of the eigenvalues of the Hessian. This leads to simple expressions for nearly-optimal hyperparameters, a description of the limiting neighborhood, and average-case complexity. As a consequence, we show that (small-batch) stochastic heavy-ball momentum with a fixed momentum parameter provides no actual performance improvement over SGD when step sizes are adjusted correctly. For contrast, in the non-strongly convex setting, it is possible to get a large improvement over SGD using momentum. By introducing hyperparameters that depend on the number of samples, we propose a new algorithm SDANA (stochastic dimension adjusted Nesterov acceleration) which obtains an asymptotically optimal average-case complexity while remaining linearly convergent in the strongly convex setting without adjusting parameters.
Researcher Affiliation Collaboration Courtney Paquette Department of Mathematics and Statistics Mc Gill University Montreal, Quebec H2Y 2M5 courtney.paquette@mcgill.ca Elliot Paquette Department of Mathematics and Statistics Mc Gill University Montreal, Quebec H2Y 2M5 elliot.paquette@mcgill.ca [...] C. Paquette has part-time employment at Google Research, Brain Team, Montreal, QC.
Pseudocode Yes Algorithm 1 Generic stochastic momentum method.
Open Source Code No The paper does not contain an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Figure 6: SDANA & SGD vs Theory on MNIST. MNIST (60000 28 28 images) [Le Cun et al., 2010] is reshaped into 10 matrices of dimension 1000 4704, representing 1000 samples of groups of 6 digits (preconditioned to have centered rows of norm-1).
Dataset Splits No The paper does not explicitly provide specific training/validation/test dataset splits (e.g., percentages or sample counts) or references predefined splits with citations for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper does not list specific software components with version numbers required for reproducibility (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x).
Experiment Setup Yes Table 1: Summary of the parameters for a variety of stochastic momentum algorithms that fit within the framework of Alg. 1, denote the normalized trace by m def = n 1 Pn i=1 kaik2. The default parameters are chosen so that its linear rate is no slower, by a factor of 4 than the fastest possible rate for an algorithm having optimized over all step size choices. Figure 3: Convergence of SDANA. Halting time of SDANA vs SGD with default parameters on the Gaussian random least squares problem (2.1) with varying d and n = 1024.