When Will Gradient Regularization Be Harmful?
Authors: Yang Zhao, Hao Zhang, Xiuyuan Hu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical and theoretical analyses suggest this is due to GR inducing instability and divergence in gradient statistics of adaptive optimizers at the initial training stage. Inspired by the warmup heuristic, we propose three GR warmup strategies, each relaxing the regularization effect to a certain extent during the warmup course to ensure the accurate and stable accumulation of gradients. With experiments on Vision Transformer family, we confirm the three GR warmup strategies can effectively circumvent these issues, thereby largely improving the model performance. |
| Researcher Affiliation | Academia | 1Department of Electronic Engineering, Tsinghua University. Correspondence to: Hao Zhang <haozhang@tsinghua.edu.cn>, Yang Zhao <zhao-yang@tsinghua.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 Gradient Regularization Warmup Strategies |
| Open Source Code | Yes | Code is available at https://github.com/zhaoyang0204/gnp. |
| Open Datasets | Yes | We select to train Vi T-Ti, Vi T-S and Vi T-B model architectures from scratch on Cifar-{10, 100} dataset. [...] Meanwhile, we have also extended our evaluation to include Tiny Image Net as well as Image Net, as shown at Table 5. |
| Dataset Splits | No | The paper mentions using Cifar-{10, 100}, Tiny Image Net, and Image Net datasets for training and reports test error rates, but it does not explicitly provide the specific training/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. It only refers to 'test error rates' and 'final training results' without detailing the split methodology. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments. |
| Software Dependencies | No | The paper mentions optimizers like Adam and RMSProp but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, CUDA versions) needed to replicate the experiment. |
| Experiment Setup | Yes | To ensure optimal model performance, we will leverage established training recipes recommended in the contemporary works (Dosovitskiy et al., 2021; Foret et al., 2021; Zhao et al., 2022b; Karakida et al., 2023) and additionally conduct searches on essential hyperparameters. Details on the training process can be found in the Appendix. [...] Table 3. The basic hyperparameters for training Vi Ts. (Includes specific values for Epoch, LR/GR Warmup epoch, Batch size, Basic learning rate, Weight decay, etc.) |