reproducibilityindex.ai

When Will Gradient Regularization Be Harmful?

Authors: Yang Zhao, Hao Zhang, Xiuyuan Hu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our empirical and theoretical analyses suggest this is due to GR inducing instability and divergence in gradient statistics of adaptive optimizers at the initial training stage. Inspired by the warmup heuristic, we propose three GR warmup strategies, each relaxing the regularization effect to a certain extent during the warmup course to ensure the accurate and stable accumulation of gradients. With experiments on Vision Transformer family, we conﬁrm the three GR warmup strategies can effectively circumvent these issues, thereby largely improving the model performance.
Researcher Affiliation	Academia	1Department of Electronic Engineering, Tsinghua University. Correspondence to: Hao Zhang <haozhang@tsinghua.edu.cn>, Yang Zhao <zhao-yang@tsinghua.edu.cn>.
Pseudocode	Yes	Algorithm 1 Gradient Regularization Warmup Strategies
Open Source Code	Yes	Code is available at https://github.com/zhaoyang0204/gnp.
Open Datasets	Yes	We select to train Vi T-Ti, Vi T-S and Vi T-B model architectures from scratch on Cifar-{10, 100} dataset. [...] Meanwhile, we have also extended our evaluation to include Tiny Image Net as well as Image Net, as shown at Table 5.
Dataset Splits	No	The paper mentions using Cifar-{10, 100}, Tiny Image Net, and Image Net datasets for training and reports test error rates, but it does not explicitly provide the specific training/validation/test dataset splits (e.g., percentages or sample counts) needed for reproduction. It only refers to 'test error rates' and 'final training results' without detailing the split methodology.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions optimizers like Adam and RMSProp but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions, CUDA versions) needed to replicate the experiment.
Experiment Setup	Yes	To ensure optimal model performance, we will leverage established training recipes recommended in the contemporary works (Dosovitskiy et al., 2021; Foret et al., 2021; Zhao et al., 2022b; Karakida et al., 2023) and additionally conduct searches on essential hyperparameters. Details on the training process can be found in the Appendix. [...] Table 3. The basic hyperparameters for training Vi Ts. (Includes specific values for Epoch, LR/GR Warmup epoch, Batch size, Basic learning rate, Weight decay, etc.)