Forget Sharpness: Perturbed Forgetting of Model Biases Within SAM Dynamics
Authors: Ankit Vani, Frederick Tung, Gabriel L. Oliveira, Hossein Sharifi-Noghabi
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present the perturbed forgetting perspective of SAM. We relate perturbed forgetting to generalization based on the information bottleneck principle, argue how standard SAM perturbations decrease an information-theoretic generalization bound, and empirically validate that forgetting can correlate with generalization better than loss surface flatness. Embracing the perturbed forgetting perspective, we design the OBF perturbation that targets model biases exposed in the model s outputs. Despite not necessarily exhibiting the lowest sharpness, our perturbation leads to improved generalization with the SAM, GSAM, and ASAM frameworks on Image Net (Deng et al., 2009) and robustness benchmarks using Vi Ts (Dosovitskiy et al., 2020) and Res Nets (He et al., 2016). Our results suggest that the training dynamics of SAM may be more important than minimizing loss surface sharpness. The pursuit of flat minima could be a red herring, and the benefits of SAM s training dynamics might be better explained by other mechanistic principles. |
| Researcher Affiliation | Collaboration | 1Mila, Universit e de Montr eal 2Borealis AI. |
| Pseudocode | Yes | Algorithm A.1 Iterated Output Bias Forgetting |
| Open Source Code | Yes | Source code: https://github.com/BorealisAI/perturbed-forgetting. |
| Open Datasets | Yes | Datasets. We train our models on Image Net-1K, also known as Image Net-V1 (Deng et al., 2009), and also perform finetuning experiments with CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). |
| Dataset Splits | Yes | Data Collection. Unlike sharpness, which can be evaluated on the converged parameters of the model, perturbations are inherently dynamic and need to be captured at various points during training. To this end, we collect the softmax model outputs ˆY on the CIFAR-10 validation set every 25th epoch during training for unperturbed and perturbed parameters for our pool of models. We evaluate on the Image Net validation set, and the additional test sets Image Net-Real (Beyer et al., 2020) and Image Net-V2 (Recht et al., 2019). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific cloud instances) used for running its experiments. It only mentions 'All models are trained with a global batch size of 4096' and other training parameters. |
| Software Dependencies | No | The paper mentions software like 'ADAMW' and 'SGD' as optimizers and 'BCE' or 'CE' for loss functions, but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | Training. We follow the setting of GSAM (Zhuang et al., 2022) and Chen et al. (2021), and train our models with Inception-style pre-processing (Szegedy et al., 2015) without strong data augmentations for both Vi T and Res Net models. All models are trained with a global batch size of 4096, perturbation batch size m = 64, and linear learning rate decay schedule with warmup. We apply the the same scheduling of the perturbation radius ρ that GSAM uses for both GSAM and SAM, which provide stronger baseline results, but keep ρ constant when using the OBF perturbation. We provide all hyperparameter values in Appendix C. Finetuning. When finetuning on CIFAR-{10,100}, we use the same pre-processing scheme as we do for training. We finetune Vi T-S/32 and Res Net-50 with SGD with momentum 0.9 for 100 epochs, without weight decay, and gradients clipped to global norm 1. We use a smaller batch size of 512, but keep the perturbation batch size m = 64. All other hyperparameters are provided in Appendix C. |