reproducibilityindex.ai

Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well

Authors: Vipul Gupta, Santiago Akle Serrano, Dennis DeCoste

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and Image Net.
Researcher Affiliation	Collaboration	Vipul Gupta vipul gupta@berkeley.edu Deparment of EECS, UC Berkeley; Santiago Akle Serrano sakle@apple.com Apple Inc.; Dennis De Coste ddecoste@apple.com Apple Inc.
Pseudocode	Yes	Algorithm 1: Stochastic Weight Averaging in Parallel (SWAP)
Open Source Code	No	The paper mentions using 'publicly available models' and provides links to DAWNBench and a specific ImageNet model, but does not state that the authors' implementation of SWAP itself is open-source or provide a link to their own code.
Open Datasets	Yes	For several image classiﬁcation tasks on popular computer vision datasets (CIFAR10, CIFAR100, and Image Net), we show that SWAP achieves generalization performance comparable to models trained with small-batches but does so in time similar to that of a training run with large-batches.
Dataset Splits	No	The paper provides hyperparameter tables that include 'Stopping Accuracy (%)' which implies validation for early stopping, but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, and testing.
Hardware Specification	Yes	All experiments were run on one machine with 8 NVIDIA Tesla V100 GPUs and use Horovod (Sergeev & Del Balso, 2018) to distribute the computation. Our small-batch experiments train Image Net for 28 epochs using the published schedules with no modiﬁcation and are run on 8 Tesla V100 GPUs. Our large-batch experiments modify the schedules by doubling the batch size and doubling the learning rates (see Figure 5) and are run on 16 Tesla V100 GPUs.
Software Dependencies	No	The paper mentions 'mini-batch SGD with Nesterov momentum' and 'Horovod', but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch, TensorFlow).
Experiment Setup	Yes	For the experiments in this subsection, we found the best hyper-parameters using grid searches (see Appendix A for details). Appendix A contains Table 5: Hyperparameters obtained using tuning for CIFAR10 and Table 6: Hyperparameters obtained using tuning for CIFAR100, which list 'Batch-size', 'Learning-rate Peak', 'Maximum Epochs', 'Warm-up Epochs', 'GPUs used per model', and 'Stopping Accuracy (%).'