Directional Smoothness and Gradient Methods: Convergence and Adaptivity
Authors: Aaron Mishkin, Ahmed Khaled, Yuanhao Wang, Aaron Defazio, Robert Gower
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop new sub-optimality bounds for gradient descent (GD) that depend on the conditioning of the objective along the path of optimization rather than on global, worst-case constants. Key to our proofs is directional smoothness, a measure of gradient variation that we use to develop upper-bounds on the objective. ... Experiments on logistic regression show our convergence guarantees are tighter than the classical theory based on L-smoothness. |
| Researcher Affiliation | Collaboration | Aaron Mishkin Stanford University amishkin@cs.stanford.edu Ahmed Khaled Princeton University ahmed.khaled@princeton.edu Yuanhao Wang Princeton University yuanhaoa@prince ton.edu Aaron Defazio FAIR, Meta AI adefazio@meta.com Robert M. Gower CCM, Flatiron Institute gowerrobert@gmail.com |
| Pseudocode | Yes | Algorithm 1 Gradient Descent with Exponential Search 1: Procedure Exponential Search(x, η0) 2: for k = 1, 2, 3, . . . do 3: ηout Root Finding Bisection x, 2 2kη0, η0 . 4: if ηout < then 5: Return ηout 6: end if 7: end for 8: End Procedure 9: Procedure Root Finding Bisection(x, ηlo, ηhi) 10: Define ϕ(η) = η ψ(η) where ψ(η) is given in (24) \ One access to ϕ requires T descent steps. 11: if ϕ(ηhi) 0 then 12: Return ηhi 13: end if 14: if ϕ(ηlo) > 0 then 15: Return 16: end if 17: while ηhi > 2ηlo do 18: ηmid = ηloηhi 19: if ϕ(ηmid) > 0 then 20: ηhi = ηmid 21: else 22: ηlo = ηmid 23: end if \ Invariant: ϕ(ηhi) > 0, and ϕ(ηlo) 0. 24: end while 25: Return ηlo 26: End Procedure |
| Open Source Code | No | Answer: [No] Justification: We will release the code to reproduce our experiments upon acceptance of the paper. All non-synthetic data used in this paper is open source and freely accessible online. |
| Open Datasets | Yes | We evaluate the practical improvement of our convergence rates over those using L-smoothness on two logistic regression problems taken from the UCI repository (Asuncion and Newman, 2007). ... For the UCI datasets, we use the pre-processed version of the data provided by Fernández-Delgado et al. (2014) |
| Dataset Splits | Yes | Instead, we randomly perform an 80 20 train-test split and use the test set for validation. |
| Hardware Specification | Yes | The experiments in Figure 3 were run on a Mac Book Pro (16 inch, 2019) with a 2.6 GHz 6-Core Intel i7 CPU and 16GB of memory. All other experiments were run on a Slurm cluster with several different node configurations. Our experiments on the cluster were run with nodes using (i) Nvidia A100 GPUs (80GB or 40GB memory) or Nvidia H100-80GB GPUs with Icelake CPUs, or (ii) Nvidia V100-32GB or V100-16GB GPUs with Skylake CPUs. All jobs were allocated a single GPU and 24GB of RAM. |
| Software Dependencies | No | We run our logistic regression experiments using Py Torch (Paszke et al., 2019). ... In order to compute the strongly adapted step-sizes, we run the Sci Py (Virtanen et al., 2020) implementation of Newton method on Equation (44). ... We run these experiments using vanilla Num Py. |
| Experiment Setup | Yes | Unless otherwise stated, all methods are initialized using the Kaiming initialization (He et al., 2015), which is standard in Py Torch. ... For normalized GD, we use the step-size schedule ηk = η0/k as suggested by our theory. To pick η0, we run a grid search on the grid generated by np.logspace(-8, 1, 20). ... We implement Ad GD from scratch and use a starting step-size of η0 = 10^-3. ... We generate a quadratic optimization problem minx 1/2 x^T A x − b^T x, where the eigenvalues of A were generated to follow power law distribution with parameter α = 3. We scaled the eigenvalues to ensure L = 1000. The dimension of the problem we create is d = 300. |