Escaping Saddles with Stochastic Gradients

Authors: Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, Thomas Hofmann

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide experimental evidence suggesting the validity of this condition for training neural networks. In particular we show that, while the variance of uniform noise along eigenvectors corresponding to the most negative eigenvalue decreases as O(1/d), stochastic gradients have a significant component along this direction independent of the width and depth of the neural net. When looking at the entire eigenspectrum, we find that this variance increases with the magnitude of the associated eigenvalues. Hereby, we contribute to a better understanding of the success of training deep networks with SGD and its extensions.
Researcher Affiliation Academia Hadi Daneshmand * 1 Jonas Kohler * 1 Aurelien Lucchi 1 Thomas Hofmann 1 1ETH, Zurich, Switzerland. Correspondence to: Hadi Daneshmand <hadi.daneshmand@inf.eth.ch>.
Pseudocode Yes Algorithm 1 CNC-PGD; Algorithm 2 CNC-SGD
Open Source Code No The paper does not provide any specific links or explicit statements about the availability of its source code.
Open Datasets Yes All of these experiments are conducted using feed-forward networks on the well-known MNIST classification task (n = 70 000).
Dataset Splits No The paper mentions using the MNIST dataset but does not explicitly specify the training, validation, or test splits (e.g., percentages or exact counts) or reference a standard split with a citation for reproducibility.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU specifications) used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experiments.
Experiment Setup Yes The parameters we use for the half-space problem are as follows: learning rate η = 0.05 for SGD, η = 0.005 for GD, r = 0.1 for perturbed methods. For the neural network experiments, we use a constant learning rate of 0.01 for SGD.