Directional Smoothness and Gradient Methods: Convergence and Adaptivity

Authors: Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert Gower

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. ... Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness.
Researcher Affiliation Collaboration Aaron Mishkin Stanford University amishkin@cs.stanford.edu Ahmed Khaled Princeton University ahmed.khaled@princeton.edu Yuanhao Wang Princeton University yuanhaoa@prince ton.edu Aaron Defazio FAIR, Meta AI adefazio@meta.com Robert M. Gower CCM, Flatiron Institute gowerrobert@gmail.com
Pseudocode Yes Algorithm 1 Gradient Descent with Exponential Search 1: Procedure Exponential Search(x, η0) 2: for k = 1, 2, 3, . . . do 3: ηout Root Finding Bisection x, 2 2kη0, η0 . 4: if ηout < then 5: Return ηout 6: end if 7: end for 8: End Procedure 9: Procedure Root Finding Bisection(x, ηlo, ηhi) 10: Define ϕ(η) = η ψ(η) where ψ(η) is given in (24) \ One access to ϕ requires T descent steps. 11: if ϕ(ηhi) 0 then 12: Return ηhi 13: end if 14: if ϕ(ηlo) > 0 then 15: Return 16: end if 17: while ηhi > 2ηlo do 18: ηmid = ηloηhi 19: if ϕ(ηmid) > 0 then 20: ηhi = ηmid 21: else 22: ηlo = ηmid 23: end if \ Invariant: ϕ(ηhi) > 0, and ϕ(ηlo) 0. 24: end while 25: Return ηlo 26: End Procedure
Open Source Code No Answer: [No] Justification: We will release the code to reproduce our experiments upon acceptance of the paper. All non-synthetic data used in this paper is open source and freely accessible online.
Open Datasets Yes We evaluate the practical improvement of our convergence rates over those using L-smoothness on two logistic regression problems taken from the UCI repository (Asuncion and Newman, 2007). ... For the UCI datasets, we use the pre-processed version of the data provided by Fernández-Delgado et al. (2014)
Dataset Splits Yes Instead, we randomly perform an 80 20 train-test split and use the test set for validation.
Hardware Specification Yes The experiments in Figure 3 were run on a Mac Book Pro (16 inch, 2019) with a 2.6 GHz 6-Core Intel i7 CPU and 16GB of memory. All other experiments were run on a Slurm cluster with several different node configurations. Our experiments on the cluster were run with nodes using (i) Nvidia A100 GPUs (80GB or 40GB memory) or Nvidia H100-80GB GPUs with Icelake CPUs, or (ii) Nvidia V100-32GB or V100-16GB GPUs with Skylake CPUs. All jobs were allocated a single GPU and 24GB of RAM.
Software Dependencies No We run our logistic regression experiments using Py Torch (Paszke et al., 2019). ... In order to compute the strongly adapted step-sizes, we run the Sci Py (Virtanen et al., 2020) implementation of Newton method on Equation (44). ... We run these experiments using vanilla Num Py.
Experiment Setup Yes Unless otherwise stated, all methods are initialized using the Kaiming initialization (He et al., 2015), which is standard in Py Torch. ... For normalized GD, we use the step-size schedule ηk = η0/k as suggested by our theory. To pick η0, we run a grid search on the grid generated by np.logspace(-8, 1, 20). ... We implement Ad GD from scratch and use a starting step-size of η0 = 10^-3. ... We generate a quadratic optimization problem minx 1/2 x^T A x − b^T x, where the eigenvalues of A were generated to follow power law distribution with parameter α = 3. We scaled the eigenvalues to ensure L = 1000. The dimension of the problem we create is d = 300.