Stochastic Weight Averaging in Parallel: Large-Batch Training That Generalizes Well
Authors: Vipul Gupta, Santiago Akle Serrano, Dennis DeCoste
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the reduction in training time and the good generalization performance of the resulting models on the computer vision datasets CIFAR10, CIFAR100, and Image Net. |
| Researcher Affiliation | Collaboration | Vipul Gupta vipul gupta@berkeley.edu Deparment of EECS, UC Berkeley; Santiago Akle Serrano sakle@apple.com Apple Inc.; Dennis De Coste ddecoste@apple.com Apple Inc. |
| Pseudocode | Yes | Algorithm 1: Stochastic Weight Averaging in Parallel (SWAP) |
| Open Source Code | No | The paper mentions using 'publicly available models' and provides links to DAWNBench and a specific ImageNet model, but does not state that the authors' implementation of SWAP itself is open-source or provide a link to their own code. |
| Open Datasets | Yes | For several image classification tasks on popular computer vision datasets (CIFAR10, CIFAR100, and Image Net), we show that SWAP achieves generalization performance comparable to models trained with small-batches but does so in time similar to that of a training run with large-batches. |
| Dataset Splits | No | The paper provides hyperparameter tables that include 'Stopping Accuracy (%)' which implies validation for early stopping, but it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, and testing. |
| Hardware Specification | Yes | All experiments were run on one machine with 8 NVIDIA Tesla V100 GPUs and use Horovod (Sergeev & Del Balso, 2018) to distribute the computation. Our small-batch experiments train Image Net for 28 epochs using the published schedules with no modification and are run on 8 Tesla V100 GPUs. Our large-batch experiments modify the schedules by doubling the batch size and doubling the learning rates (see Figure 5) and are run on 16 Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions 'mini-batch SGD with Nesterov momentum' and 'Horovod', but does not provide specific version numbers for these or other general software dependencies (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | For the experiments in this subsection, we found the best hyper-parameters using grid searches (see Appendix A for details). Appendix A contains Table 5: Hyperparameters obtained using tuning for CIFAR10 and Table 6: Hyperparameters obtained using tuning for CIFAR100, which list 'Batch-size', 'Learning-rate Peak', 'Maximum Epochs', 'Warm-up Epochs', 'GPUs used per model', and 'Stopping Accuracy (%).' |