Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well

Authors: Vipul Gupta, Santiago Akle Serrano, Dennis DeCoste

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and Image Net.
Researcher Affiliation Collaboration Vipul Gupta vipul gupta@berkeley.edu Deparment of EECS, UC Berkeley; Santiago Akle Serrano sakle@apple.com Apple Inc.; Dennis De Coste ddecoste@apple.com Apple Inc.
Pseudocode Yes Algorithm 1: Stochastic Weight Averaging in Parallel (SWAP)
Open Source Code No The paper mentions using 'publicly available models' and provides links to DAWNBench and a specific ImageNet model, but does not state that the authors' implementation of SWAP itself is open-source or provide a link to their own code.
Open Datasets Yes For several image classification tasks on popular computer vision datasets (CIFAR10, CIFAR100, and Image Net), we show that SWAP achieves generalization performance comparable to models trained with small-batches but does so in time similar to that of a training run with large-batches.
Dataset Splits No The paper provides hyperparameter tables that include 'Stopping Accuracy (%)' which implies validation for early stopping, but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, and testing.
Hardware Specification Yes All experiments were run on one machine with 8 NVIDIA Tesla V100 GPUs and use Horovod (Sergeev & Del Balso, 2018) to distribute the computation. Our small-batch experiments train Image Net for 28 epochs using the published schedules with no modification and are run on 8 Tesla V100 GPUs. Our large-batch experiments modify the schedules by doubling the batch size and doubling the learning rates (see Figure 5) and are run on 16 Tesla V100 GPUs.
Software Dependencies No The paper mentions 'mini-batch SGD with Nesterov momentum' and 'Horovod', but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes For the experiments in this subsection, we found the best hyper-parameters using grid searches (see Appendix A for details). Appendix A contains Table 5: Hyperparameters obtained using tuning for CIFAR10 and Table 6: Hyperparameters obtained using tuning for CIFAR100, which list 'Batch-size', 'Learning-rate Peak', 'Maximum Epochs', 'Warm-up Epochs', 'GPUs used per model', and 'Stopping Accuracy (%).'