Sharpness-Aware Minimization Leads to Low-Rank Features

Authors: Maksym Andriushchenko, Dara Bahri, Hossein Mobahi, Nicolas Flammarion

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Section 3, we present extensive empirical evidence of low-rank features for various models (Res Nets, Vi Ts, MLP-Mixers) trained with SAM on four classification tasks (CIFAR-10/100, Tiny Image Net, Image Net-1k) as well as for contrastive text-image training (MS-COCO).
Researcher Affiliation Collaboration Maksym Andriushchenko EPFL maksym.andriushchenko@epfl.ch Dara Bahri Google Research dbahri@google.com Hossein Mobahi Google Research hmobahi@google.com Nicolas Flammarion EPFL nicolas.flammarion@epfl.ch
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes We make our code available at https://github.com/tml-epfl/sam-low-rank-features.
Open Datasets Yes We train a Pre Act Res Net-18 (He et al., 2016b) with standard augmentations on standard deep learning datasets: CIFAR-10, CIFAR-100 (Krizhevsky and Hinton, 2009), Tiny Image Net (Le and Yang, 2015), and Image Net-1k (Deng et al., 2009), and then contrastive learning on MS-COCO (Lin et al., 2014).
Dataset Splits No The paper mentions evaluating on training examples and implies test sets, but does not provide explicit training/validation/test dataset splits (e.g., specific percentages, sample counts, or explicit mention of validation set usage) for all experiments.
Hardware Specification Yes We performed all experiments on a single Nvidia A100 GPU where we used an internal cluster for all experiments except experiments on MS-COCO, for which we used a cloud provider.
Software Dependencies No The paper mentions using Adam (Kingma and Ba, 2014) but does not provide specific version numbers for other key software components like deep learning frameworks (e.g., PyTorch, TensorFlow), Python, or CUDA.
Experiment Setup Yes We train these models with batch size 256 for 200 epochs using standard augmentations (random crops and random mirroring). For the minimal setting, we use plain SGD with the learning rate 0.05. For the state-of-the-art setting, we use SGD with the learning rate 0.1 (decayed by a factor of 10 after 50% and 90% epochs), momentum parameter 0.9, weight decay 0.0005... We use Adam (Kingma and Ba, 2014) with learning rate 0.0001 which is decayed down to 0 using a cosine decay schedule. We train these models with batch size 128 for 25 epochs without data augmentations.