On the Generalization Benefit of Noise in Stochastic Gradient Descent

Authors: Samuel Smith, Erich Elsen, Soham De

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set.
Researcher Affiliation Industry 1Deep Mind, London. Correspondence to: Samuel L. Smith <slsmith@google.com>, Soham De <sohamde@google.com>.
Pseudocode No The paper provides mathematical equations describing SGD updates but does not include any structured pseudocode or algorithm blocks.
Open Source Code No No explicit statement or link providing concrete access to open-source code for the described methodology is found in the paper.
Open Datasets Yes For clarity, in the main text we only report experiments using Wide-Res Nets on CIFAR-10 (Zagoruyko & Komodakis, 2016), however we provide additional experiments using Res Net-50 (He et al., 2016), LSTMs (Zaremba et al., 2014) and autoencoders (Sutskever et al., 2013) in the appendices.
Dataset Splits No The paper does not provide specific train/validation/test dataset splits with exact percentages, sample counts, or citations to predefined validation splits. It describes tuning hyperparameters using 'test accuracy'.
Hardware Specification No No specific hardware details (e.g., exact GPU/CPU models, memory specifications) used for running the experiments are provided in the paper.
Software Dependencies No The paper does not provide specific software dependencies (e.g., programming language versions, library names with version numbers) required to replicate the experiments.
Experiment Setup Yes The momentum coefficient m = 0.9, the L2 regularization coefficient is 5 10 4, and when batch normalization is used we set the ghost batch size to 64 (Hoffer et al., 2017). We use the same learning rate schedule for all architectures. We hold the learning rate constant for the first Nepochs/2 epochs, where Nepochs denote the number of training epochs. Then for the remainder of training, we reduce the learning rate by a factor of γ every Nepochs/20 epochs. In almost all of our experiments, we fix γ = 2... We always perform a grid search over learning rates for each batch size.