SGD with Large Step Sizes Learns Sparse Features
Authors: Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, Nicolas Flammarion
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. The code of our experiments is available at https://github.com/tml-epfl/ sgd-sparse-features. |
| Researcher Affiliation | Academia | Maksym Andriushchenko 1 Aditya Varre 1 Loucas Pillaud-Vivien 1 Nicolas Flammarion 1 1EPFL. Correspondence to: Maksym Andriushchenko <maksym.andriushchenko@epfl.ch>. |
| Pseudocode | No | The paper describes algorithms and models in prose and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code of our experiments is available at https://github.com/tml-epfl/ sgd-sparse-features. |
| Open Datasets | Yes | A typical training dynamics for a ResNet-18 trained on CIFAR-10. ... train a DenseNet-100-12 on CIFAR-10, CIFAR-100, and Tiny Image Net using SGD with batch size 256 and different step size schedules. |
| Dataset Splits | No | The paper mentions training and testing, and uses CIFAR-10/100 and Tiny ImageNet which have standard splits, but it does not explicitly provide the specific percentages or methodology used for train/validation/test splits within the text. |
| Hardware Specification | No | The paper describes experimental setups and training procedures but does not specify any hardware details such as GPU models, CPU types, or memory configurations used for the experiments. |
| Software Dependencies | No | The paper mentions using SGD and various neural network architectures, but it does not specify any software versions (e.g., Python version, PyTorch/TensorFlow version, CUDA version) for reproducibility. |
| Experiment Setup | Yes | We use weight decay but no momentum or data augmentation for this experiment. ... We train a ResNet-18 on CIFAR-10, CIFAR-100, and Tiny Image Net using SGD with batch size 256 and different step size schedules. We use an exponentially increasing warmup schedule with exponent 1.05 to stabilize the training loss. We use no explicit regularization (in particular, no weight decay) in our experiments so that the training dynamics is driven purely by SGD and the step size schedule. |