Understanding the Role of Momentum in Stochastic Gradient Methods

Authors: Igor Gitman, Hunter Lang, Pengchuan Zhang, Lin Xiao

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters. ... We evaluate the average final loss for a large grid of parameters α, β and ν on three problems: a 2-dimensional quadratic function (where all of our assumptions are satisfied), logistic regression on the MNIST [16] dataset (where the quadratic assumption is approximately satisfied, but gradient noise comes from mini-batches) and Res Net-18 [10] on CIFAR-10 [13] (where all of our assumptions are likely violated). Figure 3 shows the results of this experiment.
Researcher Affiliation Industry Igor Gitman Hunter Lang Pengchuan Zhang Lin Xiao Microsoft Research AI Redmond, WA 98052, USA {igor.gitman, hunter.lang, penzhan, lin.xiao}@microsoft.com
Pseudocode No The paper describes the QHM algorithm with mathematical equations (6) and discusses its dynamics, but it does not present it in a pseudocode block or a clearly labeled algorithm section.
Open Source Code Yes The code of all of our experiments is available at https://github.com/Kipok/understanding-momentum.
Open Datasets Yes Next, we evaluate the average final loss for a large grid of parameters α, β and ν on three problems: a 2-dimensional quadratic function (where all of our assumptions are satisfied), logistic regression on the MNIST [16] dataset (where the quadratic assumption is approximately satisfied, but gradient noise comes from mini-batches) and Res Net-18 [10] on CIFAR-10 [13] (where all of our assumptions are likely violated).
Dataset Splits No The paper mentions using MNIST and CIFAR-10 datasets, which typically have predefined splits. However, the paper does not explicitly state the training, validation, or test split percentages or sample counts within the text, nor does it cite a source for specific splits in the relevant sections.
Hardware Specification No The paper describes experiments and their outcomes but does not provide any specific details about the hardware used (e.g., GPU models, CPU types, or memory specifications).
Software Dependencies No The paper describes the algorithms and experiments but does not provide specific version numbers for any software dependencies, such as programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In Section 6, by combining our results in Sections 4 and 5, we obtain new and, in some cases, counter-intuitive insight into how to set these parameters in practice. ... Figure 2: Changes in the shape and size of stationary distribution changes with respect to α, β, and ν on a 2-dimensional quadratic problem. Each picture shows the last 5000 iterates of QHM on a contour plot. The first picture of each row is a reference and other pictures should be compared to it. The second pictures show how the stationary distribution changes when we decrease α.