A Universal Class of Sharpness-Aware Minimization Algorithms

Authors: Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In the experiments, we explore extensively two specific choices of these algorithms: (1) Frob-SAM: an algorithm biased toward minimizing the Frobenius norm of the Hessian, a meaningful sharpness notion for non-convex optimization problems and (2) Det-SAM: an algorithm biased toward minimizing the determinant of the Hessian, a scale-invariant sharpness measure. We demonstrate the advantages of these two cases through an extensive series of experiments. Our code is available at https://github.com/dbahri/ universal_sam. We evaluate on three vision datasets: CIFAR10, CIFAR100, and SVHN. Futhermore, we study how our explicit bias may be helpful in settings that generally benefit from regularization specifically, when training data is limited and when training labels are noisy. For the former, we artificially subsample each original dataset, keeping only the first 10% of training samples, and we denote these sub-sampled datasets with -S (i.e. CIFAR10-S). For the latter, we choose a random 20% of training samples to corrupt, and we corrupt these samples by flipping their labels to a different label chosen uniformly at random over the remaining classes. We denote these datasets by C (i.e. CIFAR10-C).
Researcher Affiliation Collaboration Behrooz Tahmasebi 1 Ashkan Soleymani 2 Dara Bahri 3 Stefanie Jegelka 4 1 Patrick Jaillet 2 1MIT CSAIL 2MIT LIDS 3Google Deep Mind 4TU Munich. Correspondence to: Behrooz Tahmasebi <bzt@mit.edu>.
Pseudocode Yes Algorithm 1 (ϕ, ψ, µ)-Sharpness-Aware Minimization Algorithm (with m = 1)
Open Source Code Yes Our code is available at https://github.com/dbahri/ universal_sam.
Open Datasets Yes We evaluate on three vision datasets: CIFAR10, CIFAR100, and SVHN. Futhermore, we study how our explicit bias may be helpful in settings that generally benefit from regularization specifically, when training data is limited and when training labels are noisy. For the former, we artificially subsample each original dataset, keeping only the first 10% of training samples, and we denote these sub-sampled datasets with -S (i.e. CIFAR10-S). For the latter, we choose a random 20% of training samples to corrupt, and we corrupt these samples by flipping their labels to a different label chosen uniformly at random over the remaining classes. We denote these datasets by C (i.e. CIFAR10-C). We train a simple 6-layer Re LU network with 128 hidden units on MNIST using momentum-SGD (with momentum 0.9 and learning rate 0.001) for 20 epochs.
Dataset Splits No The paper mentions training and testing, and modifications to training data (subsampling, corruption), but does not explicitly detail training/validation/test dataset splits or proportions.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU, CPU models, or cloud instance types) used to run the experiments.
Software Dependencies No The paper mentions the use of 'Py Hessian library (Yao et al., 2020)' and 'momentum-SGD' but does not specify version numbers for these or other software dependencies.
Experiment Setup Yes For CIFAR10 and CIFAR100, we apply random crops and random horizontal flips. We use a momentum term of 0.9 for all datasets and a weight decay of 5e-4 for CIFAR10 and SVHN and 1e-3 for CIFAR100. We use batch size 128 and train for 200 epochs. We use a multi-step schedule where the learning rate is initially 0.1 and decays by a multiplicative factor of 0.1 every 50 epochs. We run each experiment with four different random seeds to access statistical significance. We use 1280 training examples and 100 noise samples to estimate the Frobenius norm and trace via Hessian-vector products. We set ρ to 1.0 for Det-SAM, 0.01 for Trace-SAM, and sweep it in {0.005, 0.01} for Frob-SAM. For Det-SAM and Trace-SAM we sweep λ in {0.01, 0.1, 1.0} and set n = 1. For Frob-SAM, we sweep λ in {0.0001, 0.001, 0.005, 0.01, 0.05, 0.1} and set n = 2. The hyper-parameters selected for each setting is given in Table 4 and Table 5.