Improving Sharpness-Aware Minimization by Lookahead
Authors: Runsheng Yu, Youzhi Zhang, James Kwok
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on standard benchmark datasets also verify that the proposed method outperforms the SOTAs, and converge more effectively to flat minima. |
| Researcher Affiliation | Academia | 1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology 2Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, CAS. |
| Pseudocode | Yes | Algorithm 1: Lookahead SAM and Optimistic Lookahead-SAM. Algorithm 2: Adaptive Optimistic SAM (AO-SAM). |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-source code of its proposed methodology. |
| Open Datasets | Yes | we use the popular image classification datasets CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). ... we perform experiments on the Image Net dataset using Res Net-50 (He et al., 2016)... we perform NLP paraphrase identification using the pre-trained Bert-Large (Devlin et al., 2018) on the Microsoft Research Paraphrase Corpus (MRPC) dataset (Dolan & Brockett, 2005). |
| Dataset Splits | Yes | 10% of the training set is used for validation. |
| Hardware Specification | No | The paper mentions 'GPU memory' but does not specify any particular GPU model, CPU, or other hardware specifications used for experiments. |
| Software Dependencies | No | The paper mentions 'SGD optimizer' and 'cosine learning rate schedule (Loshchilov & Hutter, 2017)' but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | Following the setup in (Jiang et al., 2023; Foret et al., 2021), we use batch size 128, initial learning rate 0.1, cosine learning rate schedule (Loshchilov & Hutter, 2017), and SGD optimizer. Learning rate η t is always set to ηt. The number of training epochs is 200. For the proposed methods, we select ρ {0.01, 0.05, 0.08, 0.1, 0.5, 0.8, 1, 1.5, 1.8, 2} by using CIFAR-10 s validation set on Res Net-18. The selected ρ is then directly used on CIFAR-100 and the other backbones. For the ct schedule in (6), since different SAM variants yield different %SAM s, we vary the hyper-parameters (κ1, κ2) so that the %SAM obtained by AO-SAM matches their %SAM values. Hyper-parameters for the other baselines are the same as their original papers. |