Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems

Authors: Bingcong Li, Liang Zhang, Niao He

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To demonstrate the effectiveness of BAR, numerical experiments are conducted on various deep learning tasks using language models (LMs). Our theoretical and empirical findings reveal that i) SAM promotes balancedness; and ii) the regularization on balancedness is data-responsive outliers have stronger impact.
Researcher Affiliation Academia Department of Computer Science ETH Zurich, Switzerland {bingcong.li, liang.zhang, niao.he}@inf.ethz.ch
Pseudocode Yes Algorithm 1 SAM (Foret et al., 2021) ... Algorithm 2 n BAR ... Algorithm 3 o BAR
Open Source Code Yes Code is available at https://github.com/Bingcong Li/BAR.
Open Datasets Yes Our evaluations are carried out on commonly-used datasets in the literature. GLUE benchmark... MNLI (inference, (Williams et al., 2018)), SST-2 (sentiment analysis, (Socher et al., 2013)), MRPC (paraphrase detection, (Dolan and Brockett, 2005)), Co LA (linguistic acceptability (Warstadt et al., 2019)), QNLI (inference (Rajpurkar et al., 2018)), QQP3 (question-answering), RTE4 (inference), and STS-B (textual similarity (Cer et al., 2017)). These datasets are released under different permissive licenses. Super GLUE benchmark... CB (inference, (De Marneffe et al., 2019)), Re Co RD (multiple-choice question answering (Zhang et al., 2018)), COPA (question answering (Roemmele et al., 2011)). These datasets are released under different permissive licenses. Web NLG Challenge... (Gardent et al., 2017). Additional datasets. We also use SQu AD (question answering (Rajpurkar et al., 2016)) in our experiments, which is released under license CC BY-SA 4.0. Other datasets include TREC (topic classification (Voorhees and Tice, 2000)) and SNLI (inference (Bowman et al., 2015)).
Dataset Splits Yes We follow the settings in (Malladi et al., 2023), and choose the backbones as Ro BERTa-large, a masked LM with 355M parameters, and OPT-1.3B, an autoregressive LM (Liu et al., 2019; Zhang et al., 2022). The training set contains k = 512 samples per class while the test set has 1000 samples. We randomly sample 1000 data for training and the other 1000 for testing.
Hardware Specification Yes All experiments are performed on a server with AMD EPYC 7742 CPUs and NVIDIA Ge Force RTX 3090 GPUs each with 24Gi B memory.
Software Dependencies No The paper mentions 'Adam W' as the base optimizer, 'FP16' and 'FP32' training, and 'Hugging Face' for model checkpoints. However, it does not specify version numbers for Python, PyTorch, TensorFlow, or other key libraries.
Experiment Setup Yes Adam W is adopted as the base optimizer, and hyperparameters are tuned from those in Table 6. Our experiments are averaged over 3 random trials. ... Table 6: Hyperparameters used for few-shot learning with Ro BERTa-large. ... Table 8: Hyperparameters used for few-shot learning with OPT-1.3B. ... Table 11: Hyperparameters used for GPT2.