Stochastic Training is Not Necessary for Generalization
Authors: Jonas Geiping, Micah Goldblum, Phil Pope, Michael Moeller, Tom Goldstein
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization. Our observations indicate that the perceived difficulty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training. We show that a standard Res Net-18 can be trained with batch size 50K (the entire training dataset) and still achieve 95.67%( 0.08) validation accuracy on CIFAR-10, which is comparable to the same network trained with a strong SGD baseline, provided data augmentation is used for both methods. |
| Researcher Affiliation | Academia | Jonas Geiping University of Siegen jgeiping@umd.edu Micah Goldblum University of Maryland goldblum@umd.edu Phillip E. Pope University of Maryland pepope@cs.umd.edu Michael Moeller University of Siegen michael.moeller@uni-siegen.de Tom Goldstein University of Maryland tomg@umd.edu |
| Pseudocode | No | The paper describes methods in prose but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our open source implementation can be found at https://github.com/JonasGeiping/fullbatchtraining and contains the exact implementation with which these results were computed and we further include all necessary scaffolding we used to run distributed experiments on arbitrarily many GPU nodes, as well as model checkpointing to run experiments on only a single machine with optional GPU. |
| Open Datasets | Yes | We use only CIFAR-10 data (Krizhevsky, 2009) in our experiments, for more information refer to https://www.cs.toronto.edu/~kriz/cifar.html. |
| Dataset Splits | Yes | We show that a standard Res Net-18 can be trained with batch size 50K (the entire training dataset) and still achieve 95.67%( 0.08) validation accuracy on CIFAR-10, which is comparable to the same network trained with a strong SGD baseline, provided data augmentation is used for both methods. Table 1: Validation accuracies on the CIFAR-10 validation set for each experiment with data augmentations considered in Section 3. |
| Hardware Specification | Yes | All experiments are run on an internal SLURM cluster of 4 × 4 + 8 NVIDIA Tesla V100-PCIE-16GB GPUs. |
| Software Dependencies | Yes | This experimental setup is implemented in Py Torch (Paszke et al., 2017), version 1.9. |
| Experiment Setup | Yes | For the SGD baseline, we train with SGD and a batch size of 128, Nesterov momentum of 0.9 and weight decay of 0.0005. Mini-batches are drawn randomly without replacement in every epoch. The learning rate is warmed up from 0.0 to 0.1 over the first 5 epochs and then reduced via cosine annealing to 0 over the course of training (Loshchilov & Hutter, 2017). The model is trained for 300 epochs. The initial learning rate of 0.4 is not particularly larger than in the small-batch regime, and it is extremely small by the standards of a linear scaling rule (Goyal et al., 2018), which would suggest a learning rate of 39, or even a square-scaling rule (Hoffer et al., 2017), which would predict a learning rate of 1.975 when training longer. As the size of the full dataset is certainly larger than any critical batch size, we would not expect to succeed in fewer steps than SGD. Yet, the number of steps, 3000, is simultaneously huge, when measuring efficiency in passes through the dataset, and tiny, when measuring parameter update steps. We clip the gradient over the entire dataset to have an ℓ2 norm of at most 0.25 before updating parameters. |