Why (and When) does Local SGD Generalize Better than SGD?

Authors: Xinran Gu, Kaifeng Lyu, Longbo Huang, Sanjeev Arora

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The main contributions of this paper include (i) the derivation of an SDE that captures the long-term behavior of Local SGD in the small learning rate regime, showing how noise drives the iterate to drift and diffuse after it has reached close to the manifold of local minima, (ii) a comparison between the SDEs of Local SGD and SGD, showing that Local SGD induces a stronger drift term that can result in a stronger effect of regularization, e.g., a faster reduction of sharpness, and (iii) empirical evidence validating that having a small learning rate and long enough training time enables the generalization improvement over SGD but removing either of the two conditions leads to no improvement.
Researcher Affiliation Academia Xinran Gu Institute for Interdisciplinary Information Sciences Tsinghua University gxr21@mails.tsinghua.edu.cn Kaifeng Lyu Department of Computer Science Princeton University klyu@cs.princeton.edu Longbo Huang Institute for Interdisciplinary Information Sciences Tsinghua University longbohuang@tsinghua.edu.cn Sanjeev Arora Department of Computer Science Princeton University arora@cs.princeton.edu
Pseudocode Yes Algorithms 1 and 2 show the implementations of distributed samplers for sampling local batches with and without replacement. Then Algorithms 3 to 5 show the implementations of parallel SGD, Local SGD and Post-local SGD that can run with either of the samplers.
Open Source Code Yes Our code is available at https://github.com/hmgxr128/Local-SGD.
Open Datasets Yes Our experiments are conducted on CIFAR-10 (Krizhevsky et al., 2009) and Image Net Russakovsky et al. (2015).
Dataset Splits No The paper uses standard datasets (CIFAR-10, ImageNet) which typically have predefined splits. However, it does not explicitly state the exact training/validation/test split percentages or sample counts for reproduction. It mentions 'test accuracy' and 'validation accuracy' but no clear methodology for how validation sets were created or used beyond implicit standard practices.
Hardware Specification Yes We run all CIFAR-10 experiments with Bloc = 128 on 8 NVIDIA Tesla P100 GPUs while Image Net experiments are run on 8 NVIDIA A5000 GPUS with Bloc = 32.
Software Dependencies Yes Our implementation of Res Net-56 (He et al., 2016) and VGG-16 (Simonyan & Zisserman, 2015) is based on the high-starred repository by Wei Yang2 and we use the implementation of Res Net-50 from torchvision 0.3.1.
Experiment Setup Yes We generally adopt the following training strategies. We do not add any momentum unless otherwise stated. We follow the suggestions by Jia et al. (2018) and do not add weight decay to the bias and learnable parameters in the normalization layers. For all models with Batch Norm layers, we go through 100 batches of data with batch size Bloc to estimate the running mean and variance before evaluation. Experiments on both datasets follow the standard data augmentation pipeline in He et al. (2016) except the label noise experiments.