reproducibilityindex.ai

Implicit Gradient Regularization

Authors: David Barrett, Benoit Dherin

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We explore the properties of this regularization in deep neural networks such as MLPs trained to classify MNIST digits and Res Nets trained to classify CIFAR-10 images and in a tractable two-parameter model. In these cases, we verify that IGR effectively encourages models toward minima in the vicinity of small gradient values, in ﬂatter regions with shallower slopes, and that these minima have low test error, consistent with previous observations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. Next, we empirically investigate implicit gradient regularization and explicit gradient regularization in deep neural networks.
Researcher Affiliation	Industry	David G.T. Barrett Deep Mind London barrettdavid@google.com Benoit Dherin Google Dublin dherin@google.com
Pseudocode	No	The paper presents mathematical derivations and descriptions of the methods but does not include any pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions implementing models using Haiku and JAX, which are third-party libraries, but does not provide a link or statement about open-sourcing the code specific to their described methodology.
Open Datasets	Yes	We explore the properties of this regularization in deep neural networks such as MLPs trained to classify MNIST digits and Res Nets trained to classify CIFAR-10 images and in a tractable two-parameter model.
Dataset Splits	No	The paper mentions using standard datasets like MNIST and CIFAR-10 and reports test accuracy, but it does not explicitly provide details about training, validation, and test dataset splits (e.g., percentages or sample counts).
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments.
Software Dependencies	No	All our models are implemented using Haiku (Hennigan et al., 2020). For all these experiments, we use JAX (Bradbury et al., 2018). (These mentions do not include specific version numbers for reproducibility.)
Experiment Setup	Yes	Speciﬁcally, we train 5-layer MLPs with nl units per layer, where nl {50, 100, 200, 400, 800, 1600}, h {0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005}, using Re Lu activation functions and a cross entropy loss (see Appendix A.5 for further details...). We used stochastic gradient descent for the training with a batch size of 512 for a range of learning rates l {0.005, 0.01, 0.05, 0.1, 0.2}. In our numerical experiments, we ﬁnd that larger learning rates lead to minima with smaller L2 norm (Figure 1b), closer to the ﬂatter region in the parameter plane, consistent with Prediction 2.1 and 2.2.