Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias
Authors: Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The results of numerical experiments shown in Figure 2 confirm the superiority of finite-difference GR in typical experimental settings. We trained deep neural networks using an NVIDIA A100 GPU for this experiment. |
| Researcher Affiliation | Collaboration | Ryo Karakida 1 Tomoumi Takase 1 Tomohiro Hayase 2 Kazuki Osawa 3 1Artificial Intelligence Research Center, AIST, Japan 2Cluster Metaverse Lab, Japan 3Department of Computer Science, ETH Zurich, Switzerland. |
| Pseudocode | Yes | The pseudo-code for F-GR is given in Algorithm 1. |
| Open Source Code | Yes | Py Torch code is available at https://github.com/ryokarakida/gradient_ regularization. |
| Open Datasets | Yes | We trained MLP (width 512) and Res Net on CIFAR-10 by using SGD with GR. and We trained Wide Res Net-28-10 (WRN-28-10) with γ = {0, 10 4, 10 3, 10 2, 10 1}. For F-GR and B-GR, we set ϵ = {0.01, 0.02, 0.05, 0.1, 0.2, 0.5}. We computed the average and standard deviation over 5 trials of different random initialization. We used crop and horizontal flip as data augmentation, cosine scheduling with an initial learning rate of 0.1, and set momentum 0.9, batch size 128, and weight decay 0.0001. Table S.2 reported the best average accuracy achieved over all the above combinations of hyper-parameters. R-GR achieves the highest test accuracy in all cases. Figure S.3 shows the test accuracy with γ = 0.1 for F/B-GR and the highest test accuracy of DB over all γ. It clarifies that the F-GR achieves the highest accuracy for large ε and performs better than B-GR and DB. |
| Dataset Splits | No | The paper mentions training parameters and datasets but does not explicitly describe how data was split for validation, or if a separate validation set was used to tune hyperparameters. |
| Hardware Specification | Yes | We trained deep neural networks using an NVIDIA A100 GPU for this experiment. |
| Software Dependencies | No | All experiments were implemented by Py Torch. (No version number for PyTorch or any other software is provided). |
| Experiment Setup | Yes | We used Rectified Linear Units (Re LUs) for activation functions, and set batch size 256, momentum 0.9, initial learning rate 0.01 and used a step decay of the learning rate (scaled by 5 at epochs 60, 120, 160), γ = ε = 0.05 for GR. |