reproducibilityindex.ai

On the Generalization Benefit of Noise in Stochastic Gradient Descent

Authors: Samuel Smith, Erich Elsen, Soham De

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set.
Researcher Affiliation	Industry	1Deep Mind, London. Correspondence to: Samuel L. Smith <slsmith@google.com>, Soham De <sohamde@google.com>.
Pseudocode	No	The paper provides mathematical equations describing SGD updates but does not include any structured pseudocode or algorithm blocks.
Open Source Code	No	No explicit statement or link providing concrete access to open-source code for the described methodology is found in the paper.
Open Datasets	Yes	For clarity, in the main text we only report experiments using Wide-Res Nets on CIFAR-10 (Zagoruyko & Komodakis, 2016), however we provide additional experiments using Res Net-50 (He et al., 2016), LSTMs (Zaremba et al., 2014) and autoencoders (Sutskever et al., 2013) in the appendices.
Dataset Splits	No	The paper does not provide specific train/validation/test dataset splits with exact percentages, sample counts, or citations to predefined validation splits. It describes tuning hyperparameters using 'test accuracy'.
Hardware Specification	No	No specific hardware details (e.g., exact GPU/CPU models, memory specifications) used for running the experiments are provided in the paper.
Software Dependencies	No	The paper does not provide specific software dependencies (e.g., programming language versions, library names with version numbers) required to replicate the experiments.
Experiment Setup	Yes	The momentum coefﬁcient m = 0.9, the L2 regularization coefﬁcient is 5 10 4, and when batch normalization is used we set the ghost batch size to 64 (Hoffer et al., 2017). We use the same learning rate schedule for all architectures. We hold the learning rate constant for the ﬁrst Nepochs/2 epochs, where Nepochs denote the number of training epochs. Then for the remainder of training, we reduce the learning rate by a factor of γ every Nepochs/20 epochs. In almost all of our experiments, we ﬁx γ = 2... We always perform a grid search over learning rates for each batch size.