“Convex Until Proven Guilty”: Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

Authors: Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Preliminary experiments The primary purpose of this paper is to demonstrate the feasibility of acceleration for non-convex problems using only first-order information. Given the long history of development of careful schemes for non-linear optimization, it is unrealistic to expect a simple implementation of the momentum-based Algorithm 3 to outperform state-of-the-art methods such as non-linear conjugate gradients and LBFGS. It is important, however, to understand the degree of non-convexity in problems we encounter in practice, and to investigate the efficacy of the negative curvature detectionand-exploitation scheme we propose. Toward this end, we present two experiments: (1) fitting a non-linear regression model and (2) training a small neural network. In these experiments we compare a basic implementation of Alg. 3 with a number baseline optimization methods: gradient descent (GD), non-linear conjugate gradients (NCG) (Hager and Zhang, 2006), Accelerated Gradient Descent (AGD) with adaptive restart (O Donoghue and Cand es, 2015) (RAGD), and a crippled version of Alg. 3 without negative curvature exploitation (C-Alg. 3). We compare the algorithms on the number of gradient steps, but note that the number of oracle queries per step varies between methods.
Researcher Affiliation Academia 1Stanford University, Stanford, California, USA. Correspondence to: Yair Carmon <yairc@stanford.edu>, Oliver Hinder <ohinder@stanford.edu>.
Pseudocode Yes Algorithm 1 AGD-UNTIL-GUILTY(f, y0, ", L, σ) ... Algorithm 2 EXPLOIT-NC-PAIR(f, u, v, ) ... Algorithm 3 GUARDED-NON-CONVEX-AGD(f, p0, L1, , , ) ... Algorithm 4 EXPLOIT-NC-PAIR3(f, u, v, ) ... Algorithm 5 FIND-BEST-ITERATE3(f, yt)
Open Source Code No The paper discusses implementation details and comparisons with other methods, but does not provide a specific link or explicit statement about the release of source code for its described methodology.
Open Datasets Yes For the second experiment we fit a neural network model1 comprising three fully-connected hidden layers containing 20, 10 and 5 units, respectively, on the MNIST handwritten digits dataset (Le Cun et al., 1998) (see Section E.3).
Dataset Splits No The paper mentions using the MNIST dataset, which has well-known standard splits, but it does not explicitly state the specific training, validation, or test split percentages or sample counts used in their experiments.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper states 'We implement our methods in Python and use PyTorch for automatic differentiation' in Section E.1, but does not provide specific version numbers for Python, PyTorch, or any other software dependencies.
Experiment Setup Yes For our experiments, we use a constant step size of 0.1 for gradient descent, and set = 10 3, = 10 5, and = 10 2 in Algorithm 3. For RAGD, we use the standard parameters from O Donoghue and Cand es (2015). ... We use a batch size of 256 for all methods and train for 500 epochs. Initial weights are randomized from a normal distribution with standard deviation 0.1, and biases are initialized to zero.