On the Generalization Benefit of Noise in Stochastic Gradient Descent
Authors: Samuel Smith, Erich Elsen, Soham De
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. |
| Researcher Affiliation | Industry | 1Deep Mind, London. Correspondence to: Samuel L. Smith <slsmith@google.com>, Soham De <sohamde@google.com>. |
| Pseudocode | No | The paper provides mathematical equations describing SGD updates but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | No | No explicit statement or link providing concrete access to open-source code for the described methodology is found in the paper. |
| Open Datasets | Yes | For clarity, in the main text we only report experiments using Wide-Res Nets on CIFAR-10 (Zagoruyko & Komodakis, 2016), however we provide additional experiments using Res Net-50 (He et al., 2016), LSTMs (Zaremba et al., 2014) and autoencoders (Sutskever et al., 2013) in the appendices. |
| Dataset Splits | No | The paper does not provide specific train/validation/test dataset splits with exact percentages, sample counts, or citations to predefined validation splits. It describes tuning hyperparameters using 'test accuracy'. |
| Hardware Specification | No | No specific hardware details (e.g., exact GPU/CPU models, memory specifications) used for running the experiments are provided in the paper. |
| Software Dependencies | No | The paper does not provide specific software dependencies (e.g., programming language versions, library names with version numbers) required to replicate the experiments. |
| Experiment Setup | Yes | The momentum coefficient m = 0.9, the L2 regularization coefficient is 5 10 4, and when batch normalization is used we set the ghost batch size to 64 (Hoffer et al., 2017). We use the same learning rate schedule for all architectures. We hold the learning rate constant for the first Nepochs/2 epochs, where Nepochs denote the number of training epochs. Then for the remainder of training, we reduce the learning rate by a factor of γ every Nepochs/20 epochs. In almost all of our experiments, we fix γ = 2... We always perform a grid search over learning rates for each batch size. |