Lookbehind-SAM: k steps back, 1 step forward
Authors: Goncalo Mordido, Pranshu Malviya, Aristide Baratin, Sarath Chandar
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Experimental Results In this section, we start by introducing our baselines (Section 4.1), and then we conduct several experiments to showcase the benefits of achieving a better sharpness-loss tradeoff in SAM methods. Particularly, we test the generalization performance on several models and datasets (Section 4.2) and analyze the loss landscapes at the end of training in terms of sharpness (Section 4.3). Then, we study the robustness provided by the different methods in noisy weight settings (Section 4.4). Lastly, we assess continual learning in sequential training settings (Section 4.5). |
| Researcher Affiliation | Collaboration | 1Mila Quebec AI Institute 2Polytechnique Montreal 3Samsung SAIT AI Lab Montreal 4Canada CIFAR AI Chair. |
| Pseudocode | Yes | The pseudo-code for Lookbehind is in Algorithm 1. ... Algorithm 1 Lookbehind-SAM |
| Open Source Code | Yes | Our code is available at https://github. com/chandar-lab/Lookbehind-SAM. |
| Open Datasets | Yes | We use residual networks (Res Nets) (He et al., 2016) and wide residual networks (WRN) (Zagoruyko & Komodakis, 2016) models trained from scratch on CIFAR-10, CIFAR-100 (Krizhevsky et al., 2009), and Image Net (Deng et al., 2009). |
| Dataset Splits | Yes | Table 1: Generalization performance (validation acc. %) of the different methods on several models and datasets. |
| Hardware Specification | Yes | We trained the CIFAR-10/100 models using one RTX8000 NVIDIA GPU and 1 CPU core, and the Image Net models using one A100 GPU (with 40 and 80 GB of memory for training from scratch and fine-tuning, respectively) and 6 CPU cores. |
| Software Dependencies | No | The paper mentions optimizers like 'Adam' and 'SGD' and hints at frameworks like 'PyTorch' (implicitly through common usage) and 'fairseq' without providing specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For CIFAR-10/100, we trained each model for 200 epochs with a batch size of 128, starting with a learning rate of 0.1 and dividing it by 10 every 50 epochs. All models were trained using SGD with momentum set to 0.9 and weight decay of 1e-4. |