On the Origin of Implicit Regularization in Stochastic Gradient Descent

Authors: Samuel L Smith, Benoit Dherin, David Barrett, Soham De

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small. ... In Section 2.3, we confirm empirically that the implicit regularizer can enhance the test accuracy of deep networks.
Researcher Affiliation Industry Samuel L. Smith1, Benoit Dherin2, David G. T. Barrett1 and Soham De1 1Deep Mind, 2Google {slsmith, dherin, barrettdavid,sohamde}@google.com
Pseudocode No The paper does not contain any sections, figures, or blocks explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present any structured steps in a code-like format.
Open Source Code No The paper does not include any explicit statements about releasing source code for the methodology described, nor does it provide any links to a code repository.
Open Datasets Yes We train the same model with two different (explicit) loss functions. ... We use a 10-1 Wide-Res Net model (Zagoruyko & Komodakis, 2016) for classification on CIFAR-10. ... In this section we provide additional experiments on the Fashion-MNIST dataset (Xiao et al., 2017)...
Dataset Splits No The paper discusses 'training' and 'test' sets and 'test accuracy', but it does not specify any training/test/validation dataset splits, such as percentages, sample counts for each split, or references to predefined validation splits with citations.
Hardware Specification No The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It only refers to models like '10-1 Wide-Res Net' which are network architectures, not hardware.
Software Dependencies No The paper mentions using 'SGD without Momentum' and 'training without batch normalization', which are algorithmic choices, but does not provide specific software names with version numbers for any libraries, frameworks, or environments used in the experiments.
Experiment Setup Yes We train for 6400 epochs at batch size 32 without learning rate decay using SGD without Momentum. We use standard data augmentation including crops and random flips, and we use weight decay with L2 coefficient 5 x 10^-4. ... We use a batch size B = 16 unless otherwise specified, and we do not use weight decay.