Geometry of Neural Network Loss Surfaces via Random Matrix Theory

Authors: Jeffrey Pennington, Yasaman Bahri

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis predicts and numerical simulations support that for critical points of small index, the number of negative eigenvalues scales like the 3/2 power of the energy. We conduct large-scale experiments to examine the distribution of critical points and compare with our theoretical predictions.
Researcher Affiliation Industry Jeffrey Pennington 1 Yasaman Bahri 1 1Google Brain. Correspondence to: Jeffrey Pennington <jpennin@google.com>.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about the release of source code for the methodology, nor does it provide a link to a code repository.
Open Datasets Yes Data is for a trained single-hidden-layer Re LU autoencoding network with 20 hidden units and no biases on 150 4 4 downsampled, grayscaled, whitened CIFAR-10 images. Dataset was taken from 4 4 downsampled, grayscaled, whitened CIFAR10 images.
Dataset Splits No The paper mentions using random sampling for data but does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, or testing).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, or detailed computer specifications used for running experiments.
Software Dependencies No The paper does not provide specific software dependency details, such as library or solver names with version numbers.
Experiment Setup Yes We train single-hidden-layer tanh networks of size n = 16, which also equals the input and output dimensionality. For each training run, the data and targets are randomly sampled from standard normal distributions, which makes this a kind of memorization task. [...] First we optimize the network with standard gradient descent until the loss reaches a random value between 0 and the initial loss. From that point on, we switch to minimizing a new objective, Jg = | θL|2, which, unlike the primary objective, is attracted to saddle points. Gradient descent on Jg only requires the computation of Hessian-vector products and can be executed efficiently. We discard any run for which the final Jg > 10 6; otherwise we record the final energy and index.