Improving SAM Requires Rethinking its Optimization Formulation
Authors: Wanyun Xie, Fabian Latorre, Kimon Antonakopoulos, Thomas Pethick, Volkan Cevher
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we verify the benefit of Bi SAM across a variety of models, datasets and tasks. 4.1. Image classification We follow the experimental setup of Kwon et al. (2021). We use the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009), both consisting of 50 000 training images of size 32 32, with 10 and 100 classes, respectively. [...] The results can be found in Tables 1 and 2. |
| Researcher Affiliation | Academia | 1Laboratory for Information and Inference Systems, Ecole Polytechnique F ed erale de Lausanne (EPFL), Switzerland. Correspondence to: Wanyun Xie <wanyun.xie@epfl.ch>. |
| Pseudocode | Yes | Algorithm 1 Bilevel SAM (Bi SAM) Input: Initialization w0 Rd, iterations T, batch size b, step sizes {ηt}T 1 t=0 , neighborhood size ρ > 0, µ > 0, lower bound ϕ. 1 for t = 0 to T 1 do 2 Sample minibatch B = {(x1, y1), . . . , (xb, yb)}. 3 Compute the (stochastic) gradient of the perturbation loss Qϕ,µ(wt) defined in Equation (11). 4 Compute perturbation ϵt = ρ w Q(w) 5 Compute gradient gt = w LB(wt + ϵt). 6 Update weights wt+1 = wt ηtgt. |
| Open Source Code | Yes | Our code is available at https://github.com/ LIONS-EPFL/Bi SAM. |
| Open Datasets | Yes | We use the CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009), both consisting of 50 000 training images of size 32 32, with 10 and 100 classes, respectively. ... We compare the performance of SAM and Bi SAM on Image Net-1K (Russakovsky et al., 2015). ... finetune the model on Oxford-flowers (Nilsback and Zisserman, 2008) and Oxford-IITPets (Parkhi et al., 2012) datasets. ... finetune it on the GLUE datasets (Wang et al., 2018). |
| Dataset Splits | Yes | The training data is randomly partitioned into a training set and validation set consisting of 90% and 10%, respectively. We deviate from Foret et al. (2021); Kwon et al. (2021) by using the validation set to select the model on which we report the test accuracy in order to avoid overfitting on the test set. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions 'Pytorch' and 'Tensor Flow' as software used, but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The models are trained using stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 5 10 4. We used a batch size of 128, and a cosine learning rate schedule that starts at 0.1. The number of epochs is set to 200 for SAM and Bi SAM while SGD are given 400 epochs. ... Label smoothing with a factor 0.1 is employed for all method. For the SAM and Bi SAM hyperparameter ρ we use a value of 0.05. We fix µ = 10 and α = 0.1 for Bi SAM (tanh) and µ = 1 for Bi SAM (-log) throughout all experiments on both CIFAR-10 and CIFAR-100 datasets... |