reproducibilityindex.ai

Stochastic Training is Not Necessary for Generalization

Authors: Jonas Geiping, Micah Goldblum, Phil Pope, Michael Moeller, Tom Goldstein

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization. Our observations indicate that the perceived difﬁculty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training. We show that a standard Res Net-18 can be trained with batch size 50K (the entire training dataset) and still achieve 95.67%( 0.08) validation accuracy on CIFAR-10, which is comparable to the same network trained with a strong SGD baseline, provided data augmentation is used for both methods.
Researcher Affiliation	Academia	Jonas Geiping University of Siegen jgeiping@umd.edu Micah Goldblum University of Maryland goldblum@umd.edu Phillip E. Pope University of Maryland pepope@cs.umd.edu Michael Moeller University of Siegen michael.moeller@uni-siegen.de Tom Goldstein University of Maryland tomg@umd.edu
Pseudocode	No	The paper describes methods in prose but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our open source implementation can be found at https://github.com/JonasGeiping/fullbatchtraining and contains the exact implementation with which these results were computed and we further include all necessary scaffolding we used to run distributed experiments on arbitrarily many GPU nodes, as well as model checkpointing to run experiments on only a single machine with optional GPU.
Open Datasets	Yes	We use only CIFAR-10 data (Krizhevsky, 2009) in our experiments, for more information refer to https://www.cs.toronto.edu/~kriz/cifar.html.
Dataset Splits	Yes	We show that a standard Res Net-18 can be trained with batch size 50K (the entire training dataset) and still achieve 95.67%( 0.08) validation accuracy on CIFAR-10, which is comparable to the same network trained with a strong SGD baseline, provided data augmentation is used for both methods. Table 1: Validation accuracies on the CIFAR-10 validation set for each experiment with data augmentations considered in Section 3.
Hardware Specification	Yes	All experiments are run on an internal SLURM cluster of 4 × 4 + 8 NVIDIA Tesla V100-PCIE-16GB GPUs.
Software Dependencies	Yes	This experimental setup is implemented in Py Torch (Paszke et al., 2017), version 1.9.
Experiment Setup	Yes	For the SGD baseline, we train with SGD and a batch size of 128, Nesterov momentum of 0.9 and weight decay of 0.0005. Mini-batches are drawn randomly without replacement in every epoch. The learning rate is warmed up from 0.0 to 0.1 over the ﬁrst 5 epochs and then reduced via cosine annealing to 0 over the course of training (Loshchilov & Hutter, 2017). The model is trained for 300 epochs. The initial learning rate of 0.4 is not particularly larger than in the small-batch regime, and it is extremely small by the standards of a linear scaling rule (Goyal et al., 2018), which would suggest a learning rate of 39, or even a square-scaling rule (Hoffer et al., 2017), which would predict a learning rate of 1.975 when training longer. As the size of the full dataset is certainly larger than any critical batch size, we would not expect to succeed in fewer steps than SGD. Yet, the number of steps, 3000, is simultaneously huge, when measuring efﬁciency in passes through the dataset, and tiny, when measuring parameter update steps. We clip the gradient over the entire dataset to have an ℓ2 norm of at most 0.25 before updating parameters.