Qsparse-local-SGD: Distributed SGD with Quantization, Sparsification and Local Computations
Authors: Debraj Basu, Deepesh Data, Can Karakus, Suhas Diggavi
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use Qsparse-local-SGD to train Res Net-50 on Image Net, and show that it results in significant savings over the state-of-the-art, in the number of bits transmitted to reach target accuracy. |
| Researcher Affiliation | Collaboration | Debraj Basu Adobe Inc. dbasu@adobe.com Deepesh Data UCLA deepeshdata@ucla.edu Can Karakus Amazon Inc. cakarak@amazon.com Suhas Diggavi UCLA suhasdiggavi@ucla.edu |
| Pseudocode | Yes | Algorithm 1 Qsparse-local-SGD |
| Open Source Code | Yes | Our implementation is available at https://github.com/karakusc/horovod/tree/qsparselocal. |
| Open Datasets | Yes | We implement Qsparse-local-SGD for Res Net-50 using the Image Net dataset, and show that we achieve target accuracies... We also perform analogous experiments on the MNIST [19] handwritten digits dataset for softmax regression with a standard l2 regularizer... |
| Dataset Splits | No | The paper mentions using ImageNet and MNIST datasets and discusses training and testing, but does not provide specific details on how these datasets were split into training, validation, or test sets (e.g., percentages or sample counts for each split). |
| Hardware Specification | Yes | We train Res Net-50 [13] (which has d = 25, 610, 216 parameters) on Image Net dataset, using 8 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using the 'Horovod framework [28]' but does not specify its version number or any other software dependencies with version numbers. |
| Experiment Setup | Yes | We use a learning rate schedule consisting of 5 epochs of linear warmup, followed by a piecewise decay of 0.1 at epochs 30, 60 and 80, with a batch size of 256 per GPU. For experiments, we focus on SGD with momentum of 0.9, applied on the local iterations of the workers. |