Is Local SGD Better than Minibatch SGD?
Authors: Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, Nathan Srebro
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we show that the answer to this question is sometimes. There is a regime in which local SGD indeed matches or improves upon minibatch SGD, but perhaps surprisingly, there is also a regime in which local SGD really is strictly worse than minibatch SGD. [...] We demonstrate this behaviour empirically, using a logistic regression problem where local SGD indeed behaves much worse than mini-batch SGD in the theoretically-predicted problematic regime. [...] Figure 1. We constructed a dataset of 50000 points in R25 with the ith coordinate of each point distributed independently according to a Gaussian distribution N(0, 10i2 ). |
| Researcher Affiliation | Collaboration | 1Toyota Technological Institute at Chicago 2EPFL 3University of Chicago 4Google 5Weizmann Institute of Science. Correspondence to: Blake Woodworth <blake@ttic.edu>. |
| Pseudocode | No | The paper describes the Local SGD process mathematically with equation (2), and refers to 'linear update algorithms', but it does not include a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | No | The paper describes constructing a dataset ('We constructed a dataset of 50000 points in R25'), but it does not provide concrete access information (e.g., a link, DOI, or specific citation to a public dataset repository) for this dataset. |
| Dataset Splits | No | The paper describes using a dataset for training ('We used each algorithm to train a linear model'), but it does not specify explicit training, validation, or test dataset splits (e.g., percentages or sample counts). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It only presents the empirical results without specifying the machine on which they were obtained. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | Figure 1. We constructed a dataset of 50000 points in R25 with the ith coordinate of each point distributed independently according to a Gaussian distribution N(0, 10i2 ). The labels are generated via P[y = 1 | x] = σ(min{hw , b }) where w 2 N(0, I25 25) and b 2 N(0, 1), where σ(a) = 1/(1+exp( a)) is the sigmoid function, i.e. the labels correspond to an intersection of two halfspaces with label noise which increases as one approaches the decision boundary. We used each algorithm to train a linear model with a bias term to minimize the logistic loss over the 50000 points, i.e. f is the logistic loss on one sample and D is the empirical distribution over the 50000 samples. For each M, K, and algorithm, we tuned the constant stepsize to minimize the loss after r rounds of communication individually for each 1 r R. |