Train faster, generalize better: Stability of stochastic gradient descent

Authors: Moritz Hardt, Ben Recht, Yoram Singer

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable... Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization... Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit. ... The goal of our experiments is to isolate the effect of training time, measured in number of steps, on the stability of SGM. We evaluated broadly a variety of neural network architectures and varying step sizes on a number of different datasets.
Researcher Affiliation Collaboration Moritz Hardt MRTZ@GOOGLE.COM Benjamin Recht BRECHT@BERKELEY.EDU Yoram Singer SINGER@GOOGLE.COM
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain a statement providing concrete access to the source code for the methodology described. It mentions a third-party tool's code: 'available in the cudaconvnet code1. 1https://code.google.com/archive/p/cuda-convnet'.
Open Datasets Yes We analyzed four standard machine learning datasets each with their own corresponding deep architecture. We studied the Le Net architecture for MNIST, the cuda-convnet architecture for CIFAR-10, the Alex Net model for Image Net, and the LSTM model for the Penn Treebank Language Model (PTB). ... Penn Tree Bank (PTB) (Marcus et al., 1993)
Dataset Splits Yes We focused on word-level prediction experiments using the Penn Tree Bank (PTB) (Marcus et al., 1993), consisting of 929,000 training words, 73,000 validation words, and 82,000 test words.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It discusses training models and architectures but not the underlying hardware.
Software Dependencies No The paper mentions 'cuda-convnet code' and other techniques but does not provide specific software dependencies with version numbers.
Experiment Setup Yes The learning rate was fixed at 0.01. We trained with minibatch size 60. On Image Net, we trained the standard Alex Net architecture (Krizhevsky et al., 2012) using data augmentation, regularization, and dropout. We trained with minibatch size 20. The LSTM has 200 units per layer and its parameters are initialized to have mean zero and standard deviation of 0.1.