Improving Sharpness-Aware Minimization by Lookahead

Authors: Runsheng Yu, Youzhi Zhang, James Kwok

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on standard benchmark datasets also verify that the proposed method outperforms the SOTAs, and converge more effectively to flat minima.
Researcher Affiliation Academia 1Department of Computer Science and Engineering, The Hong Kong University of Science and Technology 2Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, CAS.
Pseudocode Yes Algorithm 1: Lookahead SAM and Optimistic Lookahead-SAM. Algorithm 2: Adaptive Optimistic SAM (AO-SAM).
Open Source Code No The paper does not provide an explicit statement or link for the open-source code of its proposed methodology.
Open Datasets Yes we use the popular image classification datasets CIFAR-10 and CIFAR-100 (Krizhevsky et al., 2009). ... we perform experiments on the Image Net dataset using Res Net-50 (He et al., 2016)... we perform NLP paraphrase identification using the pre-trained Bert-Large (Devlin et al., 2018) on the Microsoft Research Paraphrase Corpus (MRPC) dataset (Dolan & Brockett, 2005).
Dataset Splits Yes 10% of the training set is used for validation.
Hardware Specification No The paper mentions 'GPU memory' but does not specify any particular GPU model, CPU, or other hardware specifications used for experiments.
Software Dependencies No The paper mentions 'SGD optimizer' and 'cosine learning rate schedule (Loshchilov & Hutter, 2017)' but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes Following the setup in (Jiang et al., 2023; Foret et al., 2021), we use batch size 128, initial learning rate 0.1, cosine learning rate schedule (Loshchilov & Hutter, 2017), and SGD optimizer. Learning rate η t is always set to ηt. The number of training epochs is 200. For the proposed methods, we select ρ {0.01, 0.05, 0.08, 0.1, 0.5, 0.8, 1, 1.5, 1.8, 2} by using CIFAR-10 s validation set on Res Net-18. The selected ρ is then directly used on CIFAR-100 and the other backbones. For the ct schedule in (6), since different SAM variants yield different %SAM s, we vary the hyper-parameters (κ1, κ2) so that the %SAM obtained by AO-SAM matches their %SAM values. Hyper-parameters for the other baselines are the same as their original papers.