Towards Optimal Communication Complexity in Distributed Non-Convex Optimization
Authors: Kumar Kshitij Patel, Lingxiao Wang, Blake E. Woodworth, Brian Bullins, Nati Srebro
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose and analyze a new algorithm that improves existing methods by requiring fewer and lighter variance reduction operations. We also present lower bounds, showing our algorithm is either optimal or almost optimal in most settings. Numerical experiments demonstrate the superior performance of our algorithm. |
| Researcher Affiliation | Academia | Kumar Kshitij Patel TTIC kkpatel@ttic.edu Lingxiao Wang TTIC lingxw@ttic.edu Blake Woodworth SIERRA, INRIA blakewoodworth@gmail.com Brian Bullins Purdue University bbullins@purdue.edu Nati Srebro TTIC nati@ttic.edu |
| Pseudocode | Yes | Algorithm 1 Communication Efficient Local Stochastic Gradient Descent (CE-LSGD) |
| Open Source Code | No | The paper states in its review checklist that code is included or available, but the provided text of the paper does not contain a specific link or explicit statement about the release of the code for the methodology described. |
| Open Datasets | Yes | We evaluate the performance of our method by optimizing a two-layer fully connected network for multi-class classification on the CIFAR-10 [35] data-set. |
| Dataset Splits | No | The paper describes how the CIFAR-10 dataset was synthetically generated to create heterogeneity, but it does not specify explicit training, validation, and test splits by percentage or sample count. |
| Hardware Specification | No | The paper states in its review checklist that hardware details are provided in the appendix, but the provided text of the paper does not include the appendix or any specific hardware specifications (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper mentions implementing the network but does not specify any software libraries or frameworks with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x). |
| Experiment Setup | Yes | We use M = 10 machines, and synthetically generate heterogeneous data-sets (see Section 4) with q = 0.1. All oracle queries use a mini-batch of size b = 16, i.e., each machine has Kb oracle queries between two communication rounds. We note that our method has a faster convergence in all the settings, which highlights its communication efficiency. Fixed step-sizes η for both the methods were tuned in {0.001, 0.005, 0.01, 0.05, 0.1, 0.5} (to obtain best loss) following [22], our method set the momentum β = 0.3, bour max = K, while b BV R max = 5000 according to [22]. |