Escaping Saddle Points with Adaptive Gradient Methods

Authors: Matthew Staib, Sashank Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Experiments We experimentally test our claims about adaptive methods escaping saddle points, and our suggestion for setting β. Escaping saddle points. First, we test our claim that when the gradient noise is ill-conditioned, adaptive methods escape saddle points faster than SGD, and often converge faster to (approximate) local minima. We construct a two dimensional2 non-convex problem f(x) = 1 n Pn i=1 fi(x) where fi(x) = 1 2x T Hx + b T i x + x 10 10. Here, H = diag([1, 0.1]), so f has a saddle point at the origin with objective value zero. The vectors bi are chosen so that sampling b uniformly from {bi}n i=1 yields E[b] = 0 and Cov(b) = diag([1, 0.01]). Hence at the origin there is an escape direction but little gradient noise in that direction. We initialize SGD and (diagonal) RMSProp (with β = 1 η2/3) at the saddle point and test several stepsizes η for each. Results for the first 104 iterations are shown in Figure 1. In order to escape the saddle point as fast as RMSProp, SGD requires a substantially larger stepsize, e.g. SGD needs η = 0.01 to escape as fast as RMSProp does with η = 0.001. But with such a large stepsize, SGD cannot converge to a small neighborhood of the local minimum, and instead bounces around due to gradient noise. Since RMSProp can escape with a small stepsize, it can converge to a much smaller neighborhood of the local minimum. Overall, for any fixed final convergence criterion, RMSProp escapes faster and converges faster overall. Setting the EMA parameter β. Next, we test our recommendations regarding setting the EMA parameter β. We consider logistic regression on MNIST. We use (diagonal) RMSProp with batch size 100, decreasing stepsize ηt = 0.001/ t and ε = 10 8, and compare different schedules for β. Specifically we test β {0.7, 0.9, 0.97, 0.99} (so that 1 β is spaced roughly logarithmically) as well as our recommendation of βt = 1 Cη2/3 t for C {0.1, 0.3, 1}. As shown in Figure 2, all options for β have similar performance initially, but as ηt decreases, large β yields substantially better performance. In particular, our decreasing β schedule achieved the best performance, and moreover was insensitive to how C was set.
Researcher Affiliation Collaboration Matthew Staib 1 2 Sashank Reddi 3 Satyen Kale 3 Sanjiv Kumar 3 Suvrit Sra 1 1MIT EECS 2Based on work performed at Google Research New York 3Google Research New York. Correspondence to: Matthew Staib <mstaib@mit.edu>.
Pseudocode Yes Algorithm 1 Preconditioned SGD; Algorithm 2 Full-matrix RMSProp; Algorithm 3 RMSProp with burn-in; Algorithm 4 Preconditioned SGD with increasing stepsize
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes Setting the EMA parameter β. Next, we test our recommendations regarding setting the EMA parameter β. We consider logistic regression on MNIST.
Dataset Splits No The paper mentions using MNIST but does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment.
Experiment Setup Yes We use (diagonal) RMSProp with batch size 100, decreasing stepsize ηt = 0.001/ t and ε = 10 8, and compare different schedules for β. Specifically we test β {0.7, 0.9, 0.97, 0.99} (so that 1 β is spaced roughly logarithmically) as well as our recommendation of βt = 1 Cη2/3 t for C {0.1, 0.3, 1}.