On the Origin of Implicit Regularization in Stochastic Gradient Descent
Authors: Samuel L Smith, Benoit Dherin, David Barrett, Soham De
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We verify empirically that explicitly including the implicit regularizer in the loss can enhance the test accuracy when the learning rate is small. ... In Section 2.3, we confirm empirically that the implicit regularizer can enhance the test accuracy of deep networks. |
| Researcher Affiliation | Industry | Samuel L. Smith1, Benoit Dherin2, David G. T. Barrett1 and Soham De1 1Deep Mind, 2Google {slsmith, dherin, barrettdavid,sohamde}@google.com |
| Pseudocode | No | The paper does not contain any sections, figures, or blocks explicitly labeled as 'Pseudocode' or 'Algorithm', nor does it present any structured steps in a code-like format. |
| Open Source Code | No | The paper does not include any explicit statements about releasing source code for the methodology described, nor does it provide any links to a code repository. |
| Open Datasets | Yes | We train the same model with two different (explicit) loss functions. ... We use a 10-1 Wide-Res Net model (Zagoruyko & Komodakis, 2016) for classification on CIFAR-10. ... In this section we provide additional experiments on the Fashion-MNIST dataset (Xiao et al., 2017)... |
| Dataset Splits | No | The paper discusses 'training' and 'test' sets and 'test accuracy', but it does not specify any training/test/validation dataset splits, such as percentages, sample counts for each split, or references to predefined validation splits with citations. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory specifications. It only refers to models like '10-1 Wide-Res Net' which are network architectures, not hardware. |
| Software Dependencies | No | The paper mentions using 'SGD without Momentum' and 'training without batch normalization', which are algorithmic choices, but does not provide specific software names with version numbers for any libraries, frameworks, or environments used in the experiments. |
| Experiment Setup | Yes | We train for 6400 epochs at batch size 32 without learning rate decay using SGD without Momentum. We use standard data augmentation including crops and random flips, and we use weight decay with L2 coefficient 5 x 10^-4. ... We use a batch size B = 16 unless otherwise specified, and we do not use weight decay. |