An SDE for Modeling SAM: Theory and Insights

Authors: Enea Monzio Compagnoni, Luca Biggio, Antonio Orvieto, Frank Norbert Proske, Hans Kersting, Aurelien Lucchi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study the SAM (Sharpness-Aware Minimization) optimizer which has recently attracted a lot of interest due to its increased performance over more classical variants of stochastic gradient descent. Our main contribution is the derivation of continuous-time models (in the form of SDEs) for SAM and two of its variants, both for the fullbatch and mini-batch settings. We demonstrate that these SDEs are rigorous approximations of the real discrete-time algorithms (in a weak sense, scaling linearly with the learning rate). Using these models, we then offer an explanation of why SAM prefers flat minima over sharp ones by showing that it minimizes an implicitly regularized loss with a Hessian-dependent noise structure. Finally, we prove that SAM is attracted to saddle points under some realistic conditions. Our theoretical results are supported by detailed experiments.
Researcher Affiliation Academia 1Department of Mathematics & Computer Science, University of Basel, Basel, Switzerland 2Department of Computer Science, ETH Z urich, Z urich, Switzerland 3Department of Mathematics, University of Oslo, Oslo, Norway 4Inria, Ecole Normale Sup erieure PSL Research University, Paris, France.
Pseudocode No The paper describes algorithms using mathematical equations for discrete updates but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets Yes The second task is a classification one on the Iris Database (Dua & Graff, 2017) using a linear MLP with 1 hidden layer. The third is a classification task on the Breast Cancer Database (Dua & Graff, 2017) using a nonlinear MLP with 1 hidden layer.
Dataset Splits No The paper mentions using specific datasets but does not provide details on how these datasets were split into training, validation, and test sets. It doesn't specify percentages or sample counts for these splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used to run the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies used in the experiments.
Experiment Setup Yes We use η = 0.01, ρ {0.001, 0.01, 0.1, 0.5}. The results are averaged over 3 experiments. ... We use η = 0.001, ρ {0.0001, 0.001, 0.03, 0.05}. ... The starting point is x0 = (0.02, , 0.02) and the number of iterations is 20000. ... The regularization parameter is fixed at λ = 0.001. We use η = 0.005, ρ = η, run for 200000 and average over 3 runs.