Sharpness-Aware Minimization Enhances Feature Quality via Balanced Learning
Authors: Jacob Mitchell Springer, Vaishnavh Nagarajan, Aditi Raghunathan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our insights are supported by experiments on real data: we demonstrate that SAM improves the quality of features in datasets containing redundant or spurious features, including Celeb A, Waterbirds, CIFAR-MNIST, and Domain Bed. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Google Research {jspringer,raditi}@cmu.edu1 vaishnavh@google.com2 |
| Pseudocode | Yes | The architecture is defined by the following pseudo-Py Torch: t o r c h . nn . S e q u e n t i a l ( t o r c h . nn . Conv2d (3 , 32 , k e r n e l s i z e =5 , s t r i d e =2 , padding =2) , t o r c h . nn . Re LU( i n p l a c e =True ) , t o r c h . nn . Conv2d (32 , 64 , k e r n e l s i z e =3 , s t r i d e =2 , padding =1) , t o r c h . nn . Re LU( i n p l a c e =True ) , t o r c h . nn . Conv2d (64 , 128 , k e r n e l s i z e =3 , s t r i d e =2 , padding =1) , t o r c h . nn . Re LU( i n p l a c e =True ) , t o r c h . nn . F l a t t e n ( ) , t o r c h . nn . Linear ( n f e a t u r e s , num classes ) ) |
| Open Source Code | No | The paper does not provide any explicit statement about making its source code available or a link to a code repository. |
| Open Datasets | Yes | Datasets. We use four datasets in our experiments each annotated by two features: Celeb A (Liu et al., 2015), Waterbirds (Sagawa et al., 2019), CIFAR-MNIST (binary) (Shah et al., 2020), and FMNIST-MNIST (5-class) (Kirichenko et al., 2022). |
| Dataset Splits | Yes | For all datasets, we use the standard train/validation/test split, and when a validation set is not provided, we use a random 90/10 split of the training set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper mentions “pseudo-Py Torch” for describing architectures and data augmentations but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | Parameters and sweeps. For the toy experiments, we choose a constant learning rate of 0.01, a batch size of 5, 300 training points, no momentum, and no weight decay. For the CIFAR-MNIST and FMNIST-MNIST experiments, we sweep over the learning rates {0.01, 0.05, 0.1} and the phantom hyperparameter ρ over {0.0, 0.01, 0.03, 0.05, 0.07, 0.1, 0.2}. We use a batch size of 100, a cosine learning rate schedule, a momentum parameter of 0.9, and no weight decay. We normalize the images by the mean pixel value. Otherwise, we do not use data augmentation. For the Celeb A and Waterbirds experiments, we sweep over the learning rates {0.0005, 0.001, 0.005, 0.01} and the ρ parameter {0.0, 0.01, 0.02, 0.05, 0.07}. We use a batch size of 128, a cosine learning rate schedule, a momentum parameter of 0.9, and a weight decay of 10 4. |