Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning

Authors: Jacob Mitchell Springer, Vaishnavh Nagarajan, Aditi Raghunathan

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including Celeb A, Waterbirds, CIFAR-MNIST, and Domain Bed.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Google Research {jspringer,raditi}@cmu.edu1 vaishnavh@google.com2
Pseudocode Yes The architecture is defined by the following pseudo-Py Torch: t o r c h . nn . S e q u e n t i a l ( t o r c h . nn . Conv2d (3 , 32 , k e r n e l s i z e =5 , s t r i d e =2 , padding =2) , t o r c h . nn . Re LU( i n p l a c e =True ) , t o r c h . nn . Conv2d (32 , 64 , k e r n e l s i z e =3 , s t r i d e =2 , padding =1) , t o r c h . nn . Re LU( i n p l a c e =True ) , t o r c h . nn . Conv2d (64 , 128 , k e r n e l s i z e =3 , s t r i d e =2 , padding =1) , t o r c h . nn . Re LU( i n p l a c e =True ) , t o r c h . nn . F l a t t e n ( ) , t o r c h . nn . Linear ( n f e a t u r e s , num classes ) )
Open Source Code No The paper does not provide any explicit statement about making its source code available or a link to a code repository.
Open Datasets Yes Datasets. We use four datasets in our experiments each annotated by two features: Celeb A (Liu et al., 2015), Waterbirds (Sagawa et al., 2019), CIFAR-MNIST (binary) (Shah et al., 2020), and FMNIST-MNIST (5-class) (Kirichenko et al., 2022).
Dataset Splits Yes For all datasets, we use the standard train/validation/test split, and when a validation set is not provided, we use a random 90/10 split of the training set.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions “pseudo-Py Torch” for describing architectures and data augmentations but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes Parameters and sweeps. For the toy experiments, we choose a constant learning rate of 0.01, a batch size of 5, 300 training points, no momentum, and no weight decay. For the CIFAR-MNIST and FMNIST-MNIST experiments, we sweep over the learning rates {0.01, 0.05, 0.1} and the phantom hyperparameter ρ over {0.0, 0.01, 0.03, 0.05, 0.07, 0.1, 0.2}. We use a batch size of 100, a cosine learning rate schedule, a momentum parameter of 0.9, and no weight decay. We normalize the images by the mean pixel value. Otherwise, we do not use data augmentation. For the Celeb A and Waterbirds experiments, we sweep over the learning rates {0.0005, 0.001, 0.005, 0.01} and the ρ parameter {0.0, 0.01, 0.02, 0.05, 0.07}. We use a batch size of 128, a cosine learning rate schedule, a momentum parameter of 0.9, and a weight decay of 10 4.