reproducibilityindex.ai

Forget Sharpness: Perturbed Forgetting of Model Biases Within SAM Dynamics

Authors: Ankit Vani, Frederick Tung, Gabriel L. Oliveira, Hossein Sharifi-Noghabi

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We present the perturbed forgetting perspective of SAM. We relate perturbed forgetting to generalization based on the information bottleneck principle, argue how standard SAM perturbations decrease an information-theoretic generalization bound, and empirically validate that forgetting can correlate with generalization better than loss surface flatness. Embracing the perturbed forgetting perspective, we design the OBF perturbation that targets model biases exposed in the model s outputs. Despite not necessarily exhibiting the lowest sharpness, our perturbation leads to improved generalization with the SAM, GSAM, and ASAM frameworks on Image Net (Deng et al., 2009) and robustness benchmarks using Vi Ts (Dosovitskiy et al., 2020) and Res Nets (He et al., 2016). Our results suggest that the training dynamics of SAM may be more important than minimizing loss surface sharpness. The pursuit of flat minima could be a red herring, and the benefits of SAM s training dynamics might be better explained by other mechanistic principles.
Researcher Affiliation	Collaboration	1Mila, Universit e de Montr eal 2Borealis AI.
Pseudocode	Yes	Algorithm A.1 Iterated Output Bias Forgetting
Open Source Code	Yes	Source code: https://github.com/BorealisAI/perturbed-forgetting.
Open Datasets	Yes	Datasets. We train our models on Image Net-1K, also known as Image Net-V1 (Deng et al., 2009), and also perform finetuning experiments with CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009).
Dataset Splits	Yes	Data Collection. Unlike sharpness, which can be evaluated on the converged parameters of the model, perturbations are inherently dynamic and need to be captured at various points during training. To this end, we collect the softmax model outputs ˆY on the CIFAR-10 validation set every 25th epoch during training for unperturbed and perturbed parameters for our pool of models. We evaluate on the Image Net validation set, and the additional test sets Image Net-Real (Beyer et al., 2020) and Image Net-V2 (Recht et al., 2019).
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or specific cloud instances) used for running its experiments. It only mentions 'All models are trained with a global batch size of 4096' and other training parameters.
Software Dependencies	No	The paper mentions software like 'ADAMW' and 'SGD' as optimizers and 'BCE' or 'CE' for loss functions, but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup	Yes	Training. We follow the setting of GSAM (Zhuang et al., 2022) and Chen et al. (2021), and train our models with Inception-style pre-processing (Szegedy et al., 2015) without strong data augmentations for both Vi T and Res Net models. All models are trained with a global batch size of 4096, perturbation batch size m = 64, and linear learning rate decay schedule with warmup. We apply the the same scheduling of the perturbation radius ρ that GSAM uses for both GSAM and SAM, which provide stronger baseline results, but keep ρ constant when using the OBF perturbation. We provide all hyperparameter values in Appendix C. Finetuning. When finetuning on CIFAR-{10,100}, we use the same pre-processing scheme as we do for training. We finetune Vi T-S/32 and Res Net-50 with SGD with momentum 0.9 for 100 epochs, without weight decay, and gradients clipped to global norm 1. We use a smaller batch size of 512, but keep the perturbation batch size m = 64. All other hyperparameters are provided in Appendix C.