Spurious Local Minima are Common in Two-Layer ReLU Neural Networks
Authors: Itay Safran, Ohad Shamir
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our proof technique is a bit unorthodox. Although it is possible to write down the gradient of Eq. (1) in closed form (without the expectation), it is not clear how to get analytical expressions for its roots, and hence characterize the stationary points of Eq. (1). As far as we know, an analytical expression for the roots might not even exist. Instead, we employed the following strategy: We ran standard gradient descent with random initialization on the objective function, until we reached a point which is both suboptimal (function value being significantly higher than 0); approximate stationary (gradient norm very close to 0); and with a strictly positive definite Hessian (with minimal eigenvalue significantly larger than 0). We use a computer to verify these conditions in a formal manner, avoiding floatingpoint arithmetic and the possibility of rounding errors. Relying on these numbers, we employ a Taylor expansion argument, to show that we must have arrived at a point very close to a local (non-global) minimum of Eq. (1), hence establishing the existence of such minima. [...] In this section, we provide a formal proof of Corollary 1, as well as an outline of the proof of Thm. 1. We also provide closed-form expressions for the objective and its derivatives. Missing parts of the proofs are provided in the appendix. |
| Researcher Affiliation | Academia | Itay Safran 1 Ohad Shamir 1 1Weizmann Institute of Science, Rehovot, Israel. Correspondence to: Itay Safran <itay.safran@weizmann.ac.il>, Ohad Shamir <ohad.shamir@weizmann.ac.il>. |
| Pseudocode | No | The paper describes the gradient descent procedure but does not present it as a formal pseudocode block or algorithm. |
| Open Source Code | Yes | The code we used is freely available at https://github.com/ItaySafran/OneLayerGDconvergence.git. |
| Open Datasets | No | The paper uses a synthetic data generation process based on a standard Gaussian distribution and a network of similar architecture for target values, not a publicly available named dataset. "We consider directly optimizing the expected squared loss, where the input is standard Gaussian, and in the realizable case namely, that the target values are generated by a network of a similar architecture" |
| Dataset Splits | No | The paper does not mention training, validation, or test dataset splits in the traditional machine learning sense. The problem is formulated with an expectation over a Gaussian distribution, and experiments are instantiations of gradient descent runs on this objective function. |
| Hardware Specification | Yes | For any candidate local minimum, the verification took from less than a minute up to a few hours, depending on the size of k, n, when running on Intel Xeon E5 processors (ranging from E5-2430 to E5-2660). |
| Software Dependencies | Yes | We used MATLAB (version 2017b) to perform all floating-point computations, and its associated MATLAB VPA package to perform the exact symbolic computations. |
| Experiment Setup | Yes | For each value of (k, n), where k [20] and n {k, . . . , 20}, we ran 1000 instantiations of gradient descent on the objective in Eq. (2), each starting from a different random initialization 1. Each instantiation was ran with a fixed step size of 0.1, until reaching a candidate stationary point / local minima (the stopping criterion was that the gradient norm w.r.t. any wi is at most 10 9). Points obtaining objective values less than 10 3 were ignored as those are likely to be close to a global minimum. [...] We used standard Xavier initialization: Each weight vector wi was samples i.i.d. from a Gaussian distribution in Rk, with zero mean and covariance 1/k I. |