Geometry of Neural Network Loss Surfaces via Random Matrix Theory
Authors: Jeffrey Pennington, Yasaman Bahri
ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our analysis predicts and numerical simulations support that for critical points of small index, the number of negative eigenvalues scales like the 3/2 power of the energy. We conduct large-scale experiments to examine the distribution of critical points and compare with our theoretical predictions. |
| Researcher Affiliation | Industry | Jeffrey Pennington 1 Yasaman Bahri 1 1Google Brain. Correspondence to: Jeffrey Pennington <jpennin@google.com>. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about the release of source code for the methodology, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Data is for a trained single-hidden-layer Re LU autoencoding network with 20 hidden units and no biases on 150 4 4 downsampled, grayscaled, whitened CIFAR-10 images. Dataset was taken from 4 4 downsampled, grayscaled, whitened CIFAR10 images. |
| Dataset Splits | No | The paper mentions using random sampling for data but does not provide specific details on dataset splits (e.g., percentages or sample counts for training, validation, or testing). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, or detailed computer specifications used for running experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details, such as library or solver names with version numbers. |
| Experiment Setup | Yes | We train single-hidden-layer tanh networks of size n = 16, which also equals the input and output dimensionality. For each training run, the data and targets are randomly sampled from standard normal distributions, which makes this a kind of memorization task. [...] First we optimize the network with standard gradient descent until the loss reaches a random value between 0 and the initial loss. From that point on, we switch to minimizing a new objective, Jg = | θL|2, which, unlike the primary objective, is attracted to saddle points. Gradient descent on Jg only requires the computation of Hessian-vector products and can be executed efficiently. We discard any run for which the final Jg > 10 6; otherwise we record the final energy and index. |