Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning

Authors: Antonio Sclocchi, Mario Geiger, Matthieu Wyart

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here we study how the magnitude of this noise T affects performance as the size of the training set P and the scale of initialization α are varied. For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the (α, T) plane. They show that SGD noise can be detrimental or instead useful depending on the training regime.
Researcher Affiliation Academia Antonio Sclocchi 1 Mario Geiger 2 Matthieu Wyart 1 1Institute of Physics, Ecole Polytechnique F ed erale de Lausanne, Lausanne, 1015, Switzerland 2Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code Yes The code with all the details of the experiments is provided at https://tinyurl.com/mrys4uyp.
Open Datasets Yes For classification of MNIST and CIFAR10 images, our central results are: (i) obtaining phase diagrams for performance in the (α, T) plane. They show that SGD noise can be detrimental or instead useful depending on the training regime. ... We consider the binary datasets MNIST (even vs odd numbers) and CIFAR10 (animals vs the rest).
Dataset Splits No The paper uses standard datasets (MNIST, CIFAR10) but does not provide explicit details on the training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit references to standard splits with citations).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU/CPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper mentions 'Re LU as activation functions' and 'PyTorch' is implied by the context of modern neural networks, but does not specify any software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes The learning rate η is kept constant during training. The end of training is reached when L(wt ) = 0. The batch size B is taken small enough to be in the noise dominated regime (Smith et al., 2020; Zhang et al., 2019), where the dynamics depends on the SGD temperature T = η/B. ... Below we use a 5-hidden-layers fully-connected (FC) network and a 9-hidden-layers convolutional neural network (CNN) (MNAS architecture (Tan et al., 2019)). ... All the networks use Re LU as activation functions. ... To control between feature and lazy training, we multiply the model output by α (Chizat et al., 2019). For the hinge loss, this is equivalent to changing the loss margin to 1/α.