reproducibilityindex.ai

“Convex Until Proven Guilty”: Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

Authors: Yair Carmon, John C. Duchi, Oliver Hinder, Aaron Sidford

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	5. Preliminary experiments The primary purpose of this paper is to demonstrate the feasibility of acceleration for non-convex problems using only ﬁrst-order information. Given the long history of development of careful schemes for non-linear optimization, it is unrealistic to expect a simple implementation of the momentum-based Algorithm 3 to outperform state-of-the-art methods such as non-linear conjugate gradients and LBFGS. It is important, however, to understand the degree of non-convexity in problems we encounter in practice, and to investigate the efﬁcacy of the negative curvature detectionand-exploitation scheme we propose. Toward this end, we present two experiments: (1) ﬁtting a non-linear regression model and (2) training a small neural network. In these experiments we compare a basic implementation of Alg. 3 with a number baseline optimization methods: gradient descent (GD), non-linear conjugate gradients (NCG) (Hager and Zhang, 2006), Accelerated Gradient Descent (AGD) with adaptive restart (O Donoghue and Cand es, 2015) (RAGD), and a crippled version of Alg. 3 without negative curvature exploitation (C-Alg. 3). We compare the algorithms on the number of gradient steps, but note that the number of oracle queries per step varies between methods.
Researcher Affiliation	Academia	1Stanford University, Stanford, California, USA. Correspondence to: Yair Carmon <yairc@stanford.edu>, Oliver Hinder <ohinder@stanford.edu>.
Pseudocode	Yes	Algorithm 1 AGD-UNTIL-GUILTY(f, y0, ", L, σ) ... Algorithm 2 EXPLOIT-NC-PAIR(f, u, v, ) ... Algorithm 3 GUARDED-NON-CONVEX-AGD(f, p0, L1, , , ) ... Algorithm 4 EXPLOIT-NC-PAIR3(f, u, v, ) ... Algorithm 5 FIND-BEST-ITERATE3(f, yt)
Open Source Code	No	The paper discusses implementation details and comparisons with other methods, but does not provide a specific link or explicit statement about the release of source code for its described methodology.
Open Datasets	Yes	For the second experiment we ﬁt a neural network model1 comprising three fully-connected hidden layers containing 20, 10 and 5 units, respectively, on the MNIST handwritten digits dataset (Le Cun et al., 1998) (see Section E.3).
Dataset Splits	No	The paper mentions using the MNIST dataset, which has well-known standard splits, but it does not explicitly state the specific training, validation, or test split percentages or sample counts used in their experiments.
Hardware Specification	No	The paper does not provide any specific details regarding the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper states 'We implement our methods in Python and use PyTorch for automatic differentiation' in Section E.1, but does not provide specific version numbers for Python, PyTorch, or any other software dependencies.
Experiment Setup	Yes	For our experiments, we use a constant step size of 0.1 for gradient descent, and set = 10 3, = 10 5, and = 10 2 in Algorithm 3. For RAGD, we use the standard parameters from O Donoghue and Cand es (2015). ... We use a batch size of 256 for all methods and train for 500 epochs. Initial weights are randomized from a normal distribution with standard deviation 0.1, and biases are initialized to zero.