Implicit Regularization of Sharpness-Aware Minimization for Scale-Invariant Problems
Authors: Bingcong Li, Liang Zhang, Niao He
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of BAR, numerical experiments are conducted on various deep learning tasks using language models (LMs). Our theoretical and empirical findings reveal that i) SAM promotes balancedness; and ii) the regularization on balancedness is data-responsive outliers have stronger impact. |
| Researcher Affiliation | Academia | Department of Computer Science ETH Zurich, Switzerland {bingcong.li, liang.zhang, niao.he}@inf.ethz.ch |
| Pseudocode | Yes | Algorithm 1 SAM (Foret et al., 2021) ... Algorithm 2 n BAR ... Algorithm 3 o BAR |
| Open Source Code | Yes | Code is available at https://github.com/Bingcong Li/BAR. |
| Open Datasets | Yes | Our evaluations are carried out on commonly-used datasets in the literature. GLUE benchmark... MNLI (inference, (Williams et al., 2018)), SST-2 (sentiment analysis, (Socher et al., 2013)), MRPC (paraphrase detection, (Dolan and Brockett, 2005)), Co LA (linguistic acceptability (Warstadt et al., 2019)), QNLI (inference (Rajpurkar et al., 2018)), QQP3 (question-answering), RTE4 (inference), and STS-B (textual similarity (Cer et al., 2017)). These datasets are released under different permissive licenses. Super GLUE benchmark... CB (inference, (De Marneffe et al., 2019)), Re Co RD (multiple-choice question answering (Zhang et al., 2018)), COPA (question answering (Roemmele et al., 2011)). These datasets are released under different permissive licenses. Web NLG Challenge... (Gardent et al., 2017). Additional datasets. We also use SQu AD (question answering (Rajpurkar et al., 2016)) in our experiments, which is released under license CC BY-SA 4.0. Other datasets include TREC (topic classification (Voorhees and Tice, 2000)) and SNLI (inference (Bowman et al., 2015)). |
| Dataset Splits | Yes | We follow the settings in (Malladi et al., 2023), and choose the backbones as Ro BERTa-large, a masked LM with 355M parameters, and OPT-1.3B, an autoregressive LM (Liu et al., 2019; Zhang et al., 2022). The training set contains k = 512 samples per class while the test set has 1000 samples. We randomly sample 1000 data for training and the other 1000 for testing. |
| Hardware Specification | Yes | All experiments are performed on a server with AMD EPYC 7742 CPUs and NVIDIA Ge Force RTX 3090 GPUs each with 24Gi B memory. |
| Software Dependencies | No | The paper mentions 'Adam W' as the base optimizer, 'FP16' and 'FP32' training, and 'Hugging Face' for model checkpoints. However, it does not specify version numbers for Python, PyTorch, TensorFlow, or other key libraries. |
| Experiment Setup | Yes | Adam W is adopted as the base optimizer, and hyperparameters are tuned from those in Table 6. Our experiments are averaged over 3 random trials. ... Table 6: Hyperparameters used for few-shot learning with Ro BERTa-large. ... Table 8: Hyperparameters used for few-shot learning with OPT-1.3B. ... Table 11: Hyperparameters used for GPT2. |