Implicit Gradient Regularization
Authors: David Barrett, Benoit Dherin
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore the properties of this regularization in deep neural networks such as MLPs trained to classify MNIST digits and Res Nets trained to classify CIFAR-10 images and in a tractable two-parameter model. In these cases, we verify that IGR effectively encourages models toward minima in the vicinity of small gradient values, in flatter regions with shallower slopes, and that these minima have low test error, consistent with previous observations. Furthermore, we demonstrate that the implicit gradient regularization term can be used as an explicit regularizer, allowing us to control this gradient regularization directly. Next, we empirically investigate implicit gradient regularization and explicit gradient regularization in deep neural networks. |
| Researcher Affiliation | Industry | David G.T. Barrett Deep Mind London barrettdavid@google.com Benoit Dherin Google Dublin dherin@google.com |
| Pseudocode | No | The paper presents mathematical derivations and descriptions of the methods but does not include any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions implementing models using Haiku and JAX, which are third-party libraries, but does not provide a link or statement about open-sourcing the code specific to their described methodology. |
| Open Datasets | Yes | We explore the properties of this regularization in deep neural networks such as MLPs trained to classify MNIST digits and Res Nets trained to classify CIFAR-10 images and in a tractable two-parameter model. |
| Dataset Splits | No | The paper mentions using standard datasets like MNIST and CIFAR-10 and reports test accuracy, but it does not explicitly provide details about training, validation, and test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, or memory) used for running the experiments. |
| Software Dependencies | No | All our models are implemented using Haiku (Hennigan et al., 2020). For all these experiments, we use JAX (Bradbury et al., 2018). (These mentions do not include specific version numbers for reproducibility.) |
| Experiment Setup | Yes | Specifically, we train 5-layer MLPs with nl units per layer, where nl {50, 100, 200, 400, 800, 1600}, h {0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005}, using Re Lu activation functions and a cross entropy loss (see Appendix A.5 for further details...). We used stochastic gradient descent for the training with a batch size of 512 for a range of learning rates l {0.005, 0.01, 0.05, 0.1, 0.2}. In our numerical experiments, we find that larger learning rates lead to minima with smaller L2 norm (Figure 1b), closer to the flatter region in the parameter plane, consistent with Prediction 2.1 and 2.2. |