Escaping Saddles with Stochastic Gradients
Authors: Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, Thomas Hofmann
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we provide experimental evidence suggesting the validity of this condition for training neural networks. In particular we show that, while the variance of uniform noise along eigenvectors corresponding to the most negative eigenvalue decreases as O(1/d), stochastic gradients have a significant component along this direction independent of the width and depth of the neural net. When looking at the entire eigenspectrum, we find that this variance increases with the magnitude of the associated eigenvalues. Hereby, we contribute to a better understanding of the success of training deep networks with SGD and its extensions. |
| Researcher Affiliation | Academia | Hadi Daneshmand * 1 Jonas Kohler * 1 Aurelien Lucchi 1 Thomas Hofmann 1 1ETH, Zurich, Switzerland. Correspondence to: Hadi Daneshmand <hadi.daneshmand@inf.eth.ch>. |
| Pseudocode | Yes | Algorithm 1 CNC-PGD; Algorithm 2 CNC-SGD |
| Open Source Code | No | The paper does not provide any specific links or explicit statements about the availability of its source code. |
| Open Datasets | Yes | All of these experiments are conducted using feed-forward networks on the well-known MNIST classification task (n = 70 000). |
| Dataset Splits | No | The paper mentions using the MNIST dataset but does not explicitly specify the training, validation, or test splits (e.g., percentages or exact counts) or reference a standard split with a citation for reproducibility. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU specifications) used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed to replicate the experiments. |
| Experiment Setup | Yes | The parameters we use for the half-space problem are as follows: learning rate η = 0.05 for SGD, η = 0.005 for GD, r = 0.1 for perturbed methods. For the neural network experiments, we use a constant learning rate of 0.01 for SGD. |