Sharpness-Aware Minimization Leads to Low-Rank Features
Authors: Maksym Andriushchenko, Dara Bahri, Hossein Mobahi, Nicolas Flammarion
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In Section 3, we present extensive empirical evidence of low-rank features for various models (Res Nets, Vi Ts, MLP-Mixers) trained with SAM on four classification tasks (CIFAR-10/100, Tiny Image Net, Image Net-1k) as well as for contrastive text-image training (MS-COCO). |
| Researcher Affiliation | Collaboration | Maksym Andriushchenko EPFL maksym.andriushchenko@epfl.ch Dara Bahri Google Research dbahri@google.com Hossein Mobahi Google Research hmobahi@google.com Nicolas Flammarion EPFL nicolas.flammarion@epfl.ch |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We make our code available at https://github.com/tml-epfl/sam-low-rank-features. |
| Open Datasets | Yes | We train a Pre Act Res Net-18 (He et al., 2016b) with standard augmentations on standard deep learning datasets: CIFAR-10, CIFAR-100 (Krizhevsky and Hinton, 2009), Tiny Image Net (Le and Yang, 2015), and Image Net-1k (Deng et al., 2009), and then contrastive learning on MS-COCO (Lin et al., 2014). |
| Dataset Splits | No | The paper mentions evaluating on training examples and implies test sets, but does not provide explicit training/validation/test dataset splits (e.g., specific percentages, sample counts, or explicit mention of validation set usage) for all experiments. |
| Hardware Specification | Yes | We performed all experiments on a single Nvidia A100 GPU where we used an internal cluster for all experiments except experiments on MS-COCO, for which we used a cloud provider. |
| Software Dependencies | No | The paper mentions using Adam (Kingma and Ba, 2014) but does not provide specific version numbers for other key software components like deep learning frameworks (e.g., PyTorch, TensorFlow), Python, or CUDA. |
| Experiment Setup | Yes | We train these models with batch size 256 for 200 epochs using standard augmentations (random crops and random mirroring). For the minimal setting, we use plain SGD with the learning rate 0.05. For the state-of-the-art setting, we use SGD with the learning rate 0.1 (decayed by a factor of 10 after 50% and 90% epochs), momentum parameter 0.9, weight decay 0.0005... We use Adam (Kingma and Ba, 2014) with learning rate 0.0001 which is decayed down to 0 using a cosine decay schedule. We train these models with batch size 128 for 25 epochs without data augmentations. |