Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
On the Choice of Learning Rate for Local SGD
Authors: Lukas Balles, Prabhu Teja S, Cedric Archambeau
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that the optimal learning rate for Local SGD differs substantially from that of SGD, and when using it the performance of Local SGD matches that of SGD. However, this performance comes at the cost of added training iterations, rendering Local SGD faster than SGD only when communication is much more time-consuming than computation. |
| Researcher Affiliation | Industry | Lukas Balles EMAIL Aleph Alpha, Heidelberg, Germany. Work done at AWS. Prabhu Teja S EMAIL Amazon Web Services, Berlin, Germany. Cédric Archambeau EMAIL Helsing, Berlin, Germany. Work done at AWS. |
| Pseudocode | Yes | Algorithm 1 Automatic learning rate scaling for Local SGD Local Ada Scale. |
| Open Source Code | No | The paper mentions using a third-party library 'Fair Scale' and refers to GitHub repositories for model code, but it does not contain an explicit statement or a direct link from the authors releasing the source code specific to their methodology described in this paper. |
| Open Datasets | Yes | We train a Res Net-18 (He et al., 2016) on CIFAR-10 (Krizhevsky, 2009), a Wide Resnet-28-2 (Zagoruyko & Komodakis, 2016) on Image Net-32 (Chrabaszcz et al., 2017), and a Res Net-50 on Image Net (Deng et al., 2009; Russakovsky et al., 2015). |
| Dataset Splits | Yes | The target performance for each experiment is the performance we get when training on one worker (K = 1) with the standard hyperparameter settings (in Appendix H). For CIFAR-10 it is a top-1 accuracy of 93%, for Image Net32 it is a top-5 accuracy of 69%, and for Image Net it is a top-5 accuracy of 93%. See Appendix H.1 for attributions. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU models, or memory) used for the authors' experiments are provided in the paper. The paper discusses 'compute infrastructure' and 'hypothetical system' parameters (like relative communication overhead 'm'), but not the actual hardware used for their empirical evaluations. |
| Software Dependencies | No | The paper mentions using 'Fair Scale' and 'torchvision library' and refers to GitHub repositories for model code (e.g., 'pytorch-cifar') but does not specify exact version numbers for these software components. |
| Experiment Setup | Yes | Training hyperparameters are listed in the following table: Dataset γbase Momentum Weight decay LR schedule Epochs CIFAR-10 0.1 0.9 5 10 4 Cosine-decay 200 Image Net32 0.01 0.9 5 10 4 Step ( 0.5 every 10 epochs) 40 Image Net 0.1 0.9 5 10 4 Step ( 0.1 every 30 epochs) 90 |