Normalization Layers Are All That Sharpness-Aware Minimization Needs

Authors: Maximilian Mueller, Tiffany Vlaar, David Rolnick, Matthias Hein

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We showcase the effect of SAM-ON, i.e. only applying SAM to the Batch Norm parameters, for a Wide Res Net-28-10 (WRN-28) on CIFAR-100 in Figure 1. We observe that SAM-ON obtains higher accuracy than conventional SAM (SAM-all) for all SAM variants considered (more SAMvariants are shown in Figure 6 in the Appendix). ... We report mean accuracy and standard deviation over 3 seeds for CIFAR-100 in Table 1.
Researcher Affiliation Academia Maximilian Müller University of Tübingen and Tübingen AI Center maximilian.mueller@wsii.uni-tuebingen.de Tiffany Vlaar Mc Gill University and Mila Quebec AI Institute tiffany.vlaar@mila.quebec David Rolnick Mc Gill University and Mila Quebec AI Institute drolnick@cs.mcgill.ca Matthias Hein University of Tübingen and Tübingen AI Center matthias.hein@uni-tuebingen.de
Pseudocode No No structured pseudocode or algorithm blocks with labels like 'Pseudocode' or 'Algorithm' were found.
Open Source Code Yes Code is provided at https://github.com/mueller-mp/SAM-ON.
Open Datasets Yes We showcase the effect of SAM-ON...on CIFAR-100 in Figure 1.
Dataset Splits No No explicit statement providing specific training/validation/test dataset splits (e.g., percentages or sample counts for a validation set) was found. The paper primarily discusses training and test phases, for example, 'We train models for 200 epochs' and reporting 'Test Accuracy (%).'
Hardware Specification Yes We train a Res Net-50 for 100 epochs on eight 2080-Ti GPUs with m = 64, leading to an overall batch-size of 512.
Software Dependencies No The paper mentions 'Py Torch' and refers to 'timm training script [53]' but does not provide specific version numbers for these or other software dependencies, which are necessary for full reproducibility.
Experiment Setup Yes For Res Nets, we follow [37] and adopt a learning rate of 0.1, momentum of 0.9, weight decay of 0.0005 and use label smoothing with a factor of 0.1.