Don't Use Large Mini-batches, Use Local SGD

Authors: Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Key aspects of the empirical performance of local SGD compared to mini-batch baselines are illustrated in Figure 1. In scenario 1), comparing local SGD with H 4 (A4) with mini-batch SGD of same effective batch size B 4Bloc (A3) reveals a stark difference, both in terms of train and test error (local SGD achieves lower training loss and higher test accuracy). This motivates the use of local SGD as an alternative to large-batch training a hypothesis that we confirm in our experiments. Further, in scenario 2), mini-batch SGD with smaller batch size B Bloc (A2) is observed to suffer from poor generalization, although the training curve matches the single-machine baseline (A1). Our main contributions can thus be summarized as follows: Trade-offs in Local SGD: We provide the first comprehensive empirically study of the trade-offs in local SGD for deep learning when varying the number of workers K, number of local steps H and mini-batch sizes for both scenarios 1) on communication efficiency and 2) on generalization. Our empirical experiments on standard benchmarks show that post-local SGD can reach flatter minima than large-batch SGD on those problems.
Researcher Affiliation Collaboration Tao Lin EPFL, Switzerland tao.lin@epfl.ch Sebastian U. Stich EPFL, Switzerland sebastian.stich@epfl.ch Kumar Kshitij Patel IIT Kanpur, India kumarkshitijpatel@gmail.com Martin Jaggi EPFL, Switzerland martin.jaggi@epfl.ch
Pseudocode Yes Algorithm 1 Local SGD ... Algorithm 2 Post-local SGD ... Algorithm 3 (Post-)Local SGD with the compression scheme in sign SGD ... Algorithm 4 (Post-)Local SGD with the compression scheme in EF-sign SGD ... Algorithm 5 Hierarchical Local SGD
Open Source Code Yes Our algorithms are implemented3 in Py Torch (Paszke et al., 2017), with a flexible configuration of the machine topology supported by Kubernetes. The cluster consists of Intel Xeon E5-2680 v3 servers and each server has 2 NVIDIA TITAN Xp GPUs. We use the notion a b-GPU to denote the topology of the cluster, i.e., a nodes and each with b GPUs. Our code is available at https://github.com/epfml/Local SGD-Code.
Open Datasets Yes Datasets. We evaluate all methods on the following two main (standard) tasks: (1) Image classification for CIFAR-10/100 (Krizhevsky & Hinton, 2009), and (2) Image classification for Image Net (Russakovsky et al., 2015). The detailed data augmentation scheme refers to Appendix A. ... Image classification for CIFAR-10/100 (Krizhevsky & Hinton, 2009). Each consists of a training set of 50K and a test set of 10K color images of 32 32 pixels, as well as 10 and 100 target classes respectively. ... Image classification for Image Net (Russakovsky et al., 2015). The ILSVRC 2012 classification dataset consists of 1.28 million images for training, and 50K for validation, with 1K target classes. We use Image Net-1k (Deng et al., 2009).
Dataset Splits Yes Image classification for Image Net (Russakovsky et al., 2015). The ILSVRC 2012 classification dataset consists of 1.28 million images for training, and 50K for validation, with 1K target classes. We use Image Net-1k (Deng et al., 2009). ... The data is disjointly partitioned and reshuffled globally every epoch. ... The training procedure is terminated when the distributed algorithms have accessed the same number of samples as a standalone worker would access. For example, Res Net-20, Denset Net-40-12 and Wide Res Net-28-10 would access 300, 300 and 250 epochs respectively. The data is partitioned among the GPUs and reshuffled globally every epoch. The local mini-batches are then sampled among the local data available on each GPU, and its size is fixed to Bloc 128. ... Res Net-50 training is limited to 90 passes over the data in total, and the data is disjointly partitioned and is re-shuffled globally every epoch.
Hardware Specification Yes Our algorithms are implemented3 in Py Torch (Paszke et al., 2017), with a flexible configuration of the machine topology supported by Kubernetes. The cluster consists of Intel Xeon E5-2680 v3 servers and each server has 2 NVIDIA TITAN Xp GPUs. We use the notion a b-GPU to denote the topology of the cluster, i.e., a nodes and each with b GPUs. We use a 8 2-GPU cluster with 10 Gbps network. ... Titan XP Tesla V100
Software Dependencies No The paper mentions "Py Torch (Paszke et al., 2017)" but does not specify a version number for PyTorch or any other software dependencies with their versions.
Experiment Setup Yes Setup. We briefly outline the general experimental setup, and refer to Appendix A for full details. ... Specific learning schemes for large-batch SGD. We rely on the recently proposed schemes for efficient large batch training (Goyal et al., 2017), which are formalized by (i) linearly scaling the learning rate w.r.t. the global mini-batch size; (ii) gradual warm-up of the learning rate from a small value. See Appendix A.3 for more details. ... Distributed training procedure on CIFAR-10/100. The experiments follow the common mini-batch SGD training scheme for CIFAR (He et al., 2016a;b; Huang et al., 2017b) and all competing methods access the same total number of data samples (i.e. gradients) regardless of the number of local steps. ... The learning rate scheme follows (He et al., 2016a; Huang et al., 2017b), where we drop the initial learning rate by a factor of 10 when the model has accessed 50% and 75% of the total number of training samples. Unless mentioned specifically, the used learning rate is scaled by the global mini-batch size (BK for mini-batch SGD and Bloc K for local SGD) where the initial learning rate is fine-tuned for each model and each task for the single worker. See Appendix A.4 for more details. ... We use Res Net-20 (He et al., 2016a) with CIFAR-10 as a base configuration...Bloc 128. ... The initial learning rates of Res Net-20, Denset Net-40-12 and Wide Res Net-28-10 are fine-tuned on single GPU (which are 0.2, 0.2 and 0.1 respectively)...we use a Nesterov momentum of 0.9 without dampening...The weight decay of Res Net-20, Denset Net-40-12 and Wide Res Net-28-10 are 1e-4, 1e-4 and 5e-4 respectively.