Towards Optimal Communication Complexity in Distributed Non-Convex Optimization

Authors: Kumar Kshitij Patel, Lingxiao Wang, Blake E. Woodworth, Brian Bullins, Nati Srebro

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose and analyze a new algorithm that improves existing methods by requiring fewer and lighter variance reduction operations. We also present lower bounds, showing our algorithm is either optimal or almost optimal in most settings. Numerical experiments demonstrate the superior performance of our algorithm.
Researcher Affiliation Academia Kumar Kshitij Patel TTIC kkpatel@ttic.edu Lingxiao Wang TTIC lingxw@ttic.edu Blake Woodworth SIERRA, INRIA blakewoodworth@gmail.com Brian Bullins Purdue University bbullins@purdue.edu Nati Srebro TTIC nati@ttic.edu
Pseudocode Yes Algorithm 1 Communication Efficient Local Stochastic Gradient Descent (CE-LSGD)
Open Source Code No The paper states in its review checklist that code is included or available, but the provided text of the paper does not contain a specific link or explicit statement about the release of the code for the methodology described.
Open Datasets Yes We evaluate the performance of our method by optimizing a two-layer fully connected network for multi-class classification on the CIFAR-10 [35] data-set.
Dataset Splits No The paper describes how the CIFAR-10 dataset was synthetically generated to create heterogeneity, but it does not specify explicit training, validation, and test splits by percentage or sample count.
Hardware Specification No The paper states in its review checklist that hardware details are provided in the appendix, but the provided text of the paper does not include the appendix or any specific hardware specifications (e.g., GPU/CPU models, memory).
Software Dependencies No The paper mentions implementing the network but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x).
Experiment Setup Yes We use M = 10 machines, and synthetically generate heterogeneous data-sets (see Section 4) with q = 0.1. All oracle queries use a mini-batch of size b = 16, i.e., each machine has Kb oracle queries between two communication rounds. We note that our method has a faster convergence in all the settings, which highlights its communication efficiency. Fixed step-sizes η for both the methods were tuned in {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} (to obtain best loss) following [22], our method set the momentum β = 0.3, bour max = K, while b BV R max = 5000 according to [22].