Surrogate Gap Minimization Improves Sharpness-Aware Training
Authors: Juntang Zhuang, Boqing Gong, Liangzhe Yuan, Yin Cui, Hartwig Adam, Nicha C Dvornek, sekhar tatikonda, James s Duncan, Ting Liu
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, GSAM consistently improves generalization (e.g., +3.2% over SAM and +5.4% over Adam W on Image Net top-1 accuracy for Vi T-B/32). |
| Researcher Affiliation | Collaboration | Juntang Zhuang1 j.zhuang@yale.edu Boqing Gong2, Liangzhe Yuan2, Yin Cui2, Hartwig Adam2 {bgong, lzyuan, yincui, hadam}@google.com Nicha C. Dvornek1, Sekhar Tatikonda1, James S. Duncan1 {nicha.dvornek, sekhar.tatikonda, james.duncan}@yale.edu liuti@google.com 1 Yale University, 2 Google Research |
| Pseudocode | Yes | Algorithm 1 GSAM Algorithm |
| Open Source Code | Yes | Code is released at https://sites.google.com/view/gsam-iclr22/home. |
| Open Datasets | Yes | We train on the Image Net-1k (Deng et al., 2009) training set using the Inception-style (Szegedy et al., 2015) pre-processing without extra training data or strong augmentation. |
| Dataset Splits | No | The paper mentions training on 'Image Net-1k' and evaluating on other ImageNet variants (v1, v2, Real), and discusses hyperparameter searching, but it does not explicitly describe a validation dataset split or its size/proportion. |
| Hardware Specification | No | The paper does not specify any particular hardware (e.g., CPU, GPU models, or TPUs) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer', 'SGD with momentum', and 'Tensor Flow implementation', but does not provide specific version numbers for any software components or libraries. |
| Experiment Setup | Yes | For all models, we search for the best learning rate and weight decay for vanilla training, and then use the same values for the experiments with SAM and GSAM. For Res Nets, we search for ρ from 0.01 to 0.05 with a stepsize 0.01. For Vi Ts and Mixers, we search for ρ from 0.05 to 0.6 with a stepsize 0.05. In GSAM, we search for α in {0.01, 0.02, 0.03} for Res Nets and α in {0.1, 0.2, 0.3} for Vi Ts and Mixers. We summarize the best hyper-parameters for each model in Appendix B. |