Normalization Layers Are All That Sharpness-Aware Minimization Needs
Authors: Maximilian Mueller, Tiffany Vlaar, David Rolnick, Matthias Hein
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We showcase the effect of SAM-ON, i.e. only applying SAM to the Batch Norm parameters, for a Wide Res Net-28-10 (WRN-28) on CIFAR-100 in Figure 1. We observe that SAM-ON obtains higher accuracy than conventional SAM (SAM-all) for all SAM variants considered (more SAMvariants are shown in Figure 6 in the Appendix). ... We report mean accuracy and standard deviation over 3 seeds for CIFAR-100 in Table 1. |
| Researcher Affiliation | Academia | Maximilian Müller University of Tübingen and Tübingen AI Center maximilian.mueller@wsii.uni-tuebingen.de Tiffany Vlaar Mc Gill University and Mila Quebec AI Institute tiffany.vlaar@mila.quebec David Rolnick Mc Gill University and Mila Quebec AI Institute drolnick@cs.mcgill.ca Matthias Hein University of Tübingen and Tübingen AI Center matthias.hein@uni-tuebingen.de |
| Pseudocode | No | No structured pseudocode or algorithm blocks with labels like 'Pseudocode' or 'Algorithm' were found. |
| Open Source Code | Yes | Code is provided at https://github.com/mueller-mp/SAM-ON. |
| Open Datasets | Yes | We showcase the effect of SAM-ON...on CIFAR-100 in Figure 1. |
| Dataset Splits | No | No explicit statement providing specific training/validation/test dataset splits (e.g., percentages or sample counts for a validation set) was found. The paper primarily discusses training and test phases, for example, 'We train models for 200 epochs' and reporting 'Test Accuracy (%).' |
| Hardware Specification | Yes | We train a Res Net-50 for 100 epochs on eight 2080-Ti GPUs with m = 64, leading to an overall batch-size of 512. |
| Software Dependencies | No | The paper mentions 'Py Torch' and refers to 'timm training script [53]' but does not provide specific version numbers for these or other software dependencies, which are necessary for full reproducibility. |
| Experiment Setup | Yes | For Res Nets, we follow [37] and adopt a learning rate of 0.1, momentum of 0.9, weight decay of 0.0005 and use label smoothing with a factor of 0.1. |