reproducibilityindex.ai

Spurious Local Minima are Common in Two-Layer ReLU Neural Networks

Authors: Itay Safran, Ohad Shamir

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our proof technique is a bit unorthodox. Although it is possible to write down the gradient of Eq. (1) in closed form (without the expectation), it is not clear how to get analytical expressions for its roots, and hence characterize the stationary points of Eq. (1). As far as we know, an analytical expression for the roots might not even exist. Instead, we employed the following strategy: We ran standard gradient descent with random initialization on the objective function, until we reached a point which is both suboptimal (function value being signiﬁcantly higher than 0); approximate stationary (gradient norm very close to 0); and with a strictly positive deﬁnite Hessian (with minimal eigenvalue signiﬁcantly larger than 0). We use a computer to verify these conditions in a formal manner, avoiding ﬂoatingpoint arithmetic and the possibility of rounding errors. Relying on these numbers, we employ a Taylor expansion argument, to show that we must have arrived at a point very close to a local (non-global) minimum of Eq. (1), hence establishing the existence of such minima. [...] In this section, we provide a formal proof of Corollary 1, as well as an outline of the proof of Thm. 1. We also provide closed-form expressions for the objective and its derivatives. Missing parts of the proofs are provided in the appendix.
Researcher Affiliation	Academia	Itay Safran 1 Ohad Shamir 1 1Weizmann Institute of Science, Rehovot, Israel. Correspondence to: Itay Safran <itay.safran@weizmann.ac.il>, Ohad Shamir <ohad.shamir@weizmann.ac.il>.
Pseudocode	No	The paper describes the gradient descent procedure but does not present it as a formal pseudocode block or algorithm.
Open Source Code	Yes	The code we used is freely available at https://github.com/ItaySafran/OneLayerGDconvergence.git.
Open Datasets	No	The paper uses a synthetic data generation process based on a standard Gaussian distribution and a network of similar architecture for target values, not a publicly available named dataset. "We consider directly optimizing the expected squared loss, where the input is standard Gaussian, and in the realizable case namely, that the target values are generated by a network of a similar architecture"
Dataset Splits	No	The paper does not mention training, validation, or test dataset splits in the traditional machine learning sense. The problem is formulated with an expectation over a Gaussian distribution, and experiments are instantiations of gradient descent runs on this objective function.
Hardware Specification	Yes	For any candidate local minimum, the veriﬁcation took from less than a minute up to a few hours, depending on the size of k, n, when running on Intel Xeon E5 processors (ranging from E5-2430 to E5-2660).
Software Dependencies	Yes	We used MATLAB (version 2017b) to perform all ﬂoating-point computations, and its associated MATLAB VPA package to perform the exact symbolic computations.
Experiment Setup	Yes	For each value of (k, n), where k [20] and n {k, . . . , 20}, we ran 1000 instantiations of gradient descent on the objective in Eq. (2), each starting from a different random initialization 1. Each instantiation was ran with a ﬁxed step size of 0.1, until reaching a candidate stationary point / local minima (the stopping criterion was that the gradient norm w.r.t. any wi is at most 10 9). Points obtaining objective values less than 10 3 were ignored as those are likely to be close to a global minimum. [...] We used standard Xavier initialization: Each weight vector wi was samples i.i.d. from a Gaussian distribution in Rk, with zero mean and covariance 1/k I.