SGD with Large Step Sizes Learns Sparse Features

Authors: Maksym Andriushchenko, Aditya Vardhan Varre, Loucas Pillaud-Vivien, Nicolas Flammarion

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical observations that commonly used large step sizes (i) may lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics that biases it implicitly toward simple predictors. Furthermore, we show empirically that the longer large step sizes keep SGD high in the loss landscape valleys, the better the implicit regularization can operate and find sparse representations. The code of our experiments is available at https://github.com/tml-epfl/ sgd-sparse-features.
Researcher Affiliation Academia Maksym Andriushchenko 1 Aditya Varre 1 Loucas Pillaud-Vivien 1 Nicolas Flammarion 1 1EPFL. Correspondence to: Maksym Andriushchenko <maksym.andriushchenko@epfl.ch>.
Pseudocode No The paper describes algorithms and models in prose and mathematical equations, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code of our experiments is available at https://github.com/tml-epfl/ sgd-sparse-features.
Open Datasets Yes A typical training dynamics for a ResNet-18 trained on CIFAR-10. ... train a DenseNet-100-12 on CIFAR-10, CIFAR-100, and Tiny Image Net using SGD with batch size 256 and different step size schedules.
Dataset Splits No The paper mentions training and testing, and uses CIFAR-10/100 and Tiny ImageNet which have standard splits, but it does not explicitly provide the specific percentages or methodology used for train/validation/test splits within the text.
Hardware Specification No The paper describes experimental setups and training procedures but does not specify any hardware details such as GPU models, CPU types, or memory configurations used for the experiments.
Software Dependencies No The paper mentions using SGD and various neural network architectures, but it does not specify any software versions (e.g., Python version, PyTorch/TensorFlow version, CUDA version) for reproducibility.
Experiment Setup Yes We use weight decay but no momentum or data augmentation for this experiment. ... We train a ResNet-18 on CIFAR-10, CIFAR-100, and Tiny Image Net using SGD with batch size 256 and different step size schedules. We use an exponentially increasing warmup schedule with exponent 1.05 to stabilize the training loss. We use no explicit regularization (in particular, no weight decay) in our experiments so that the training dynamics is driven purely by SGD and the step size schedule.