reproducibilityindex.ai

Don't Use Large Mini-batches, Use Local SGD

Authors: Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Key aspects of the empirical performance of local SGD compared to mini-batch baselines are illustrated in Figure 1. In scenario 1), comparing local SGD with H 4 (A4) with mini-batch SGD of same effective batch size B 4Bloc (A3) reveals a stark difference, both in terms of train and test error (local SGD achieves lower training loss and higher test accuracy). This motivates the use of local SGD as an alternative to large-batch training a hypothesis that we conﬁrm in our experiments. Further, in scenario 2), mini-batch SGD with smaller batch size B Bloc (A2) is observed to suffer from poor generalization, although the training curve matches the single-machine baseline (A1). Our main contributions can thus be summarized as follows: Trade-offs in Local SGD: We provide the ﬁrst comprehensive empirically study of the trade-offs in local SGD for deep learning when varying the number of workers K, number of local steps H and mini-batch sizes for both scenarios 1) on communication efﬁciency and 2) on generalization. Our empirical experiments on standard benchmarks show that post-local SGD can reach ﬂatter minima than large-batch SGD on those problems.
Researcher Affiliation	Collaboration	Tao Lin EPFL, Switzerland tao.lin@epfl.ch Sebastian U. Stich EPFL, Switzerland sebastian.stich@epfl.ch Kumar Kshitij Patel IIT Kanpur, India kumarkshitijpatel@gmail.com Martin Jaggi EPFL, Switzerland martin.jaggi@epfl.ch
Pseudocode	Yes	Algorithm 1 Local SGD ... Algorithm 2 Post-local SGD ... Algorithm 3 (Post-)Local SGD with the compression scheme in sign SGD ... Algorithm 4 (Post-)Local SGD with the compression scheme in EF-sign SGD ... Algorithm 5 Hierarchical Local SGD
Open Source Code	Yes	Our algorithms are implemented3 in Py Torch (Paszke et al., 2017), with a ﬂexible conﬁguration of the machine topology supported by Kubernetes. The cluster consists of Intel Xeon E5-2680 v3 servers and each server has 2 NVIDIA TITAN Xp GPUs. We use the notion a b-GPU to denote the topology of the cluster, i.e., a nodes and each with b GPUs. Our code is available at https://github.com/epfml/Local SGD-Code.
Open Datasets	Yes	Datasets. We evaluate all methods on the following two main (standard) tasks: (1) Image classiﬁcation for CIFAR-10/100 (Krizhevsky & Hinton, 2009), and (2) Image classiﬁcation for Image Net (Russakovsky et al., 2015). The detailed data augmentation scheme refers to Appendix A. ... Image classiﬁcation for CIFAR-10/100 (Krizhevsky & Hinton, 2009). Each consists of a training set of 50K and a test set of 10K color images of 32 32 pixels, as well as 10 and 100 target classes respectively. ... Image classiﬁcation for Image Net (Russakovsky et al., 2015). The ILSVRC 2012 classiﬁcation dataset consists of 1.28 million images for training, and 50K for validation, with 1K target classes. We use Image Net-1k (Deng et al., 2009).
Dataset Splits	Yes	Image classiﬁcation for Image Net (Russakovsky et al., 2015). The ILSVRC 2012 classiﬁcation dataset consists of 1.28 million images for training, and 50K for validation, with 1K target classes. We use Image Net-1k (Deng et al., 2009). ... The data is disjointly partitioned and reshufﬂed globally every epoch. ... The training procedure is terminated when the distributed algorithms have accessed the same number of samples as a standalone worker would access. For example, Res Net-20, Denset Net-40-12 and Wide Res Net-28-10 would access 300, 300 and 250 epochs respectively. The data is partitioned among the GPUs and reshufﬂed globally every epoch. The local mini-batches are then sampled among the local data available on each GPU, and its size is ﬁxed to Bloc 128. ... Res Net-50 training is limited to 90 passes over the data in total, and the data is disjointly partitioned and is re-shufﬂed globally every epoch.
Hardware Specification	Yes	Our algorithms are implemented3 in Py Torch (Paszke et al., 2017), with a ﬂexible conﬁguration of the machine topology supported by Kubernetes. The cluster consists of Intel Xeon E5-2680 v3 servers and each server has 2 NVIDIA TITAN Xp GPUs. We use the notion a b-GPU to denote the topology of the cluster, i.e., a nodes and each with b GPUs. We use a 8 2-GPU cluster with 10 Gbps network. ... Titan XP Tesla V100
Software Dependencies	No	The paper mentions "Py Torch (Paszke et al., 2017)" but does not specify a version number for PyTorch or any other software dependencies with their versions.
Experiment Setup	Yes	Setup. We brieﬂy outline the general experimental setup, and refer to Appendix A for full details. ... Speciﬁc learning schemes for large-batch SGD. We rely on the recently proposed schemes for efﬁcient large batch training (Goyal et al., 2017), which are formalized by (i) linearly scaling the learning rate w.r.t. the global mini-batch size; (ii) gradual warm-up of the learning rate from a small value. See Appendix A.3 for more details. ... Distributed training procedure on CIFAR-10/100. The experiments follow the common mini-batch SGD training scheme for CIFAR (He et al., 2016a;b; Huang et al., 2017b) and all competing methods access the same total number of data samples (i.e. gradients) regardless of the number of local steps. ... The learning rate scheme follows (He et al., 2016a; Huang et al., 2017b), where we drop the initial learning rate by a factor of 10 when the model has accessed 50% and 75% of the total number of training samples. Unless mentioned speciﬁcally, the used learning rate is scaled by the global mini-batch size (BK for mini-batch SGD and Bloc K for local SGD) where the initial learning rate is ﬁne-tuned for each model and each task for the single worker. See Appendix A.4 for more details. ... We use Res Net-20 (He et al., 2016a) with CIFAR-10 as a base conﬁguration...Bloc 128. ... The initial learning rates of Res Net-20, Denset Net-40-12 and Wide Res Net-28-10 are ﬁne-tuned on single GPU (which are 0.2, 0.2 and 0.1 respectively)...we use a Nesterov momentum of 0.9 without dampening...The weight decay of Res Net-20, Denset Net-40-12 and Wide Res Net-28-10 are 1e-4, 1e-4 and 5e-4 respectively.