Sharpness-aware Minimization for Efficiently Improving Generalization
Authors: Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Image Net, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Additionally, we find that SAM natively provides robustness to label noise on par with that provided by state-of-the-art procedures that specifically target learning with noisy labels. 3 EMPIRICAL EVALUATION In order to assess SAM s efficacy, we apply it to a range of different tasks, including image classification from scratch (including on CIFAR-10, CIFAR-100, and Image Net), finetuning pretrained models, and learning with noisy labels. |
| Researcher Affiliation | Industry | Pierre Foret Google Research pierreforet@google.com Ariel Kleiner Google Research akleiner@google.com Hossein Mobahi Google Research hmobahi@google.com Behnam Neyshabur Blueshift, Alphabet neyshabur@google.com |
| Pseudocode | Yes | Algorithm 1 gives pseudo-code for the full SAM algorithm, using SGD as the base optimizer, and Figure 2 schematically illustrates a single SAM parameter update. Algorithm 1: SAM algorithm |
| Open Source Code | Yes | We open source our code at https: //github.com/google-research/sam. |
| Open Datasets | Yes | We present empirical results showing that SAM improves model generalization across a variety of benchmark datasets (e.g., CIFAR-{10, 100}, Image Net, finetuning tasks) and models, yielding novel state-of-the-art performance for several. Beyond CIFAR-{10, 100}, we have also evaluated SAM on the SVHN (Netzer et al., 2011) and Fashion-MNIST datasets (Xiao et al., 2017). |
| Dataset Splits | Yes | SAM has a single hyperparameter ρ (the neighborhood size), which we tune via a grid search over {0.01, 0.02, 0.05, 0.1, 0.2, 0.5} using 10% of the training set as a validation set. We report the validation accuracy of the bootstrapped version of SAM for different levels of noise and different ρ in table 8. |
| Hardware Specification | Yes | Our implementations utilize JAX (Bradbury et al., 2018), and we train all models on a single host having 8 Nvidia V100 GPUs. We train all models on Image Net for up to 400 epochs using a Google Cloud TPUv3 and report top-1 and top-5 test error rates for each experimental condition (mean and 95% confidence interval across 5 independent runs). |
| Software Dependencies | No | This approximation to w LSAM S (w) can be straightforwardly computed via automatic differentiation, as implemented in common libraries such as JAX, Tensor Flow, and Py Torch. While these software components are mentioned, specific version numbers are not provided. |
| Experiment Setup | Yes | All results use basic data augmentations (horizontal flip, padding by four pixels, and random crop). We also evaluate in the setting of more advanced data augmentation methods such as cutout regularization (Devries & Taylor, 2017) and Auto Augment (Cubuk et al., 2018)... All results use basic data augmentations (horizontal flip, padding by four pixels, and random crop). We train all models on Image Net for up to 400 epochs using a Google Cloud TPUv3 and report top-1 and top-5 test error rates for each experimental condition (mean and 95% confidence interval across 5 independent runs). Table 6: Hyper-parameter used to produce the CIFAR-{10,100} results (lists specific LR, WD, ρ values). |