reproducibilityindex.ai

Train faster, generalize better: Stability of stochastic gradient descent

Authors: Moritz Hardt, Ben Recht, Yoram Singer

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable... Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization... Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our ﬁndings conceptually underscore the importance of reducing training time beyond its obvious beneﬁt. ... The goal of our experiments is to isolate the effect of training time, measured in number of steps, on the stability of SGM. We evaluated broadly a variety of neural network architectures and varying step sizes on a number of different datasets.
Researcher Affiliation	Collaboration	Moritz Hardt MRTZ@GOOGLE.COM Benjamin Recht BRECHT@BERKELEY.EDU Yoram Singer SINGER@GOOGLE.COM
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain a statement providing concrete access to the source code for the methodology described. It mentions a third-party tool's code: 'available in the cudaconvnet code1. 1https://code.google.com/archive/p/cuda-convnet'.
Open Datasets	Yes	We analyzed four standard machine learning datasets each with their own corresponding deep architecture. We studied the Le Net architecture for MNIST, the cuda-convnet architecture for CIFAR-10, the Alex Net model for Image Net, and the LSTM model for the Penn Treebank Language Model (PTB). ... Penn Tree Bank (PTB) (Marcus et al., 1993)
Dataset Splits	Yes	We focused on word-level prediction experiments using the Penn Tree Bank (PTB) (Marcus et al., 1993), consisting of 929,000 training words, 73,000 validation words, and 82,000 test words.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments. It discusses training models and architectures but not the underlying hardware.
Software Dependencies	No	The paper mentions 'cuda-convnet code' and other techniques but does not provide specific software dependencies with version numbers.
Experiment Setup	Yes	The learning rate was ﬁxed at 0.01. We trained with minibatch size 60. On Image Net, we trained the standard Alex Net architecture (Krizhevsky et al., 2012) using data augmentation, regularization, and dropout. We trained with minibatch size 20. The LSTM has 200 units per layer and its parameters are initialized to have mean zero and standard deviation of 0.1.