Is Local SGD Better than Minibatch SGD?

Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, Nathan Srebro

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we show that the answer to this question is sometimes. There is a regime in which local SGD indeed matches or improves upon minibatch SGD, but perhaps surprisingly, there is also a regime in which local SGD really is strictly worse than minibatch SGD. [...] We demonstrate this behaviour empirically, using a logistic regression problem where local SGD indeed behaves much worse than mini-batch SGD in the theoretically-predicted problematic regime. [...] Figure 1. We constructed a dataset of 50000 points in R25 with the ith coordinate of each point distributed independently according to a Gaussian distribution N(0, 10i2 ).
Researcher Affiliation Collaboration 1Toyota Technological Institute at Chicago 2EPFL 3University of Chicago 4Google 5Weizmann Institute of Science. Correspondence to: Blake Woodworth <blake@ttic.edu>.
Pseudocode No The paper describes the Local SGD process mathematically with equation (2), and refers to 'linear update algorithms', but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets No The paper describes constructing a dataset ('We constructed a dataset of 50000 points in R25'), but it does not provide concrete access information (e.g., a link, DOI, or specific citation to a public dataset repository) for this dataset.
Dataset Splits No The paper describes using a dataset for training ('We used each algorithm to train a linear model'), but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only presents the empirical results without specifying the machine on which they were obtained.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment.
Experiment Setup Yes Figure 1. We constructed a dataset of 50000 points in R25 with the ith coordinate of each point distributed independently according to a Gaussian distribution N(0, 10i2 ). The labels are generated via P[y = 1 | x] = σ(min{hw , b }) where w 2 N(0, I25 25) and b 2 N(0, 1), where σ(a) = 1/(1+exp( a)) is the sigmoid function, i.e. the labels correspond to an intersection of two halfspaces with label noise which increases as one approaches the decision boundary. We used each algorithm to train a linear model with a bias term to minimize the logistic loss over the 50000 points, i.e. f is the logistic loss on one sample and D is the empirical distribution over the 50000 samples. For each M, K, and algorithm, we tuned the constant stepsize to minimize the loss after r rounds of communication individually for each 1 r R.