Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Understanding Outer Optimizers in Local SGD: Learning Rates, Momentum, and Acceleration
Authors: Ahmed Khaled, Satyen Kale, Arthur Douillard, Chi Jin, Rob Fergus, Manzil Zaheer
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study the role of the outer optimizer in Local SGD, and prove new convergence guarantees for the algorithm. In particular, we show that tuning the outer learning rate allows us to (a) trade off between optimization error and stochastic gradient noise variance, and (b) make up for ill-tuning of the inner learning rate. Our theory suggests that the outer learning rate should sometimes be set to values greater than 1. We extend our results to settings where we use momentum in the outer optimizer, and we show a similar role for the momentum-adjusted outer learning rate. We also study acceleration in the outer optimizer and show that it improves the convergence rate as a function of the number of communication rounds, improving upon the convergence rate of prior algorithms that apply acceleration locally. Finally, we also introduce a novel datadependent analysis of Local SGD that yields further insights on outer learning rate tuning. We conduct comprehensive experiments with standard language models and various outer optimizers to validate our theory. |
| Researcher Affiliation | Collaboration | Ahmed Khaled Princeton University EMAIL Satyen Kale Google Research EMAIL Arthur Douillard Google Deep Mind EMAIL Chi Jin Princeton University Princeton, NJ 08544 EMAIL Rob Fergus NYU, Meta EMAIL Manzil Zaheer Google Deep Mind EMAIL |
| Pseudocode | Yes | Algorithm 1 The Fed Opt Algorithmic Template 1: Input. Update rules Local Update and Outer Update. Initial point x0. 2: for communication rounds r = 0, 1, . . . , R 1 do 3: Broadcast xr to each node m 4: for each node m in parallel do 5: Set ym,r,0 = xr. 6: for local steps h = 0, 1, . . . , H 1 do 7: Set ym,r,h+1 = Local Update(ym,r,h, gm,r,h) for stochastic gradient gm,r,h at ym,r,h. 8: end for 9: Communicate ym,r,H to the server. 10: end for 11: Compute the update or outer gradient ˆ r,H = 1 M PM m=1(ym,r,H xr). 12: Update xr+1 = Outer Update(xr, ˆ r,H). 13: end for |
| Open Source Code | No | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: The datasets are openly available, and some of the training code will be shared. However, much of the training code is proprietary and won t be shared. |
| Open Datasets | Yes | We conduct two sets of experiments: (a) solving convex optimization problems to provide the most direct verification of the predictions of our theory, and (b) training transformer based language models. Following the Di Lo Co paper (Douillard, Feng, Rusu, Chhaparia, et al., 2023), we experiment using a Chinchilla decoder transformer (Hoffmann et al., 2022) on the C4 dataset (Raffel et al., 2020). |
| Dataset Splits | Yes | The perplexity was calculated on the C4 validation set. Consistent with the predictions of our theory, we found that an outer learning rate greater than 1.0 performed best for SF-SGD and a relatively large effective outer learning rate also performed best for Nesterov; Moreover, acceleration consistently improved performance relative to the baseline Local SGD. In the supplementary material, we report the effect of varying the number of local steps (Section A.1.2), the number of clients/replicas and different ways of FLOPs allocation (Section A.1.3), and gradient variance (Section A.1.6). We also include the validation results for all the main experiments we ran in Tables 3 to 5. |
| Hardware Specification | No | Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide the details of the FLOP budget in the supplementary. |
| Software Dependencies | No | For all experiments, the inner optimizer is Adam W (Loshchilov and Hutter, 2019) trained with a cosine learning rate schedule defined across the total amount of steps. The inner optimizer state is never shared across replicas, and is passed from one round to the other. |
| Experiment Setup | Yes | We conduct experiments on the quadratic objective f(x) = 1 2 Q(x x ) 2, where Q = A A Rd for d = 50 and the entries Ai,j are all drawn from a normal distribution Ai,j N(0, 1) for i = 1, . . . , d and j = 1, . . . , d, and x is similarly drawn from the standard d-dimensional Gaussian. We use stochastic gradients of the form g(x) = f(x) + v, where the v s are random vectors drawn from the Gaussian with mean 0 and variance σ2, v N(0, σ2). We evaluate the performance of Algorithm 1 for various values of σ, σ {10 3, 10 2, 10 1, 0.5, 1, 5, 10, 15, 25, 50}. For each σ we perform an extensive grid search over γ {0.001, 0.01, 0.1, 0.5, 0.9, 1.0, 1.1, 1.25, 1.5, 2} to determine the best one in terms of minimum average loss over the last ten rounds. We use R = 1000 rounds and H = 50 local steps, and fix η = 0.001 in all cases. Table 1: Optimizer hyperparameters for the three evaluated sizes. All are based on the transformer architecture, chinchilla-style (Hoffmann et al., 2022). Hyperparameter Selected Range considered Number of inner steps H 50, 500 50 to 2000 Peak outer LR for Nesterov 0.7 0.1 to 2.0 Peak outer LR for SF-SGD 2.0 1e 4 to 10.0 b1 for SF-SGD 0.2 0.0 to 0.99 Peak inner learning rate (150M) 4e 4 4e 4 Peak inner learning rate (400M) 4e 4 4e 4 Peak inner learning rate (1B) 2e 4 2e 4 |