Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Authors: Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results of numerical experiments shown in Figure 2 confirm the superiority of finite-difference GR in typical experimental settings. We trained deep neural networks using an NVIDIA A100 GPU for this experiment.
Researcher Affiliation Collaboration Ryo Karakida 1 Tomoumi Takase 1 Tomohiro Hayase 2 Kazuki Osawa 3 1Artificial Intelligence Research Center, AIST, Japan 2Cluster Metaverse Lab, Japan 3Department of Computer Science, ETH Zurich, Switzerland.
Pseudocode Yes The pseudo-code for F-GR is given in Algorithm 1.
Open Source Code Yes Py Torch code is available at https://github.com/ryokarakida/gradient_ regularization.
Open Datasets Yes We trained MLP (width 512) and Res Net on CIFAR-10 by using SGD with GR. and We trained Wide Res Net-28-10 (WRN-28-10) with γ = {0, 10 4, 10 3, 10 2, 10 1}. For F-GR and B-GR, we set ϵ = {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}. We computed the average and standard deviation over 5 trials of different random initialization. We used crop and horizontal flip as data augmentation, cosine scheduling with an initial learning rate of 0.1, and set momentum 0.9, batch size 128, and weight decay 0.0001. Table S.2 reported the best average accuracy achieved over all the above combinations of hyper-parameters. R-GR achieves the highest test accuracy in all cases. Figure S.3 shows the test accuracy with γ = 0.1 for F/B-GR and the highest test accuracy of DB over all γ. It clarifies that the F-GR achieves the highest accuracy for large ε and performs better than B-GR and DB.
Dataset Splits No The paper mentions training parameters and datasets but does not explicitly describe how data was split for validation, or if a separate validation set was used to tune hyperparameters.
Hardware Specification Yes We trained deep neural networks using an NVIDIA A100 GPU for this experiment.
Software Dependencies No All experiments were implemented by Py Torch. (No version number for PyTorch or any other software is provided).
Experiment Setup Yes We used Rectified Linear Units (Re LUs) for activation functions, and set batch size 256, momentum 0.9, initial learning rate 0.01 and used a step decay of the learning rate (scaled by 5 at epochs 60, 120, 160), γ = ε = 0.05 for GR.