Rethinking gradient sparsification as total error minimization
Authors: Atal Sahu, Aritra Dutta, Ahmed M. Abdelmoniem, Trambak Banerjee, Marco Canini, Panos Kalnis
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our diverse experiments on various DNNs and a logistic regression model demonstrate that the hardthreshold sparsifier is more communication-efficient than Top-k. Code is available at https://github.com/sands-lab/rethinking-sparsification. Experiments ( 6). We conduct diverse experiments on both strongly convex and non-convex (for DNNs) loss functions to substantiate our claims. Our DNN experiments include computer vision, language modeling, and recommendation tasks, and our strongly convex experiment is on logistic regression. |
| Researcher Affiliation | Academia | Atal Narayan Sahu KAUST Aritra Dutta KAUST Ahmed M. Abdelmoniem KAUST Trambak Banerjee University of Kansas Marco Canini KAUST Panos Kalnis KAUST |
| Pseudocode | Yes | Algorithm 1: Distributed EF SGD |
| Open Source Code | Yes | Code is available at https://github.com/sands-lab/rethinking-sparsification. |
| Open Datasets | Yes | Our diverse experiments on various DNNs and a logistic regression model... logistic regression model on the gisette LIBSVM dataset [14]... Res Net-18 on CIFAR-100... Res Net-50 on Image Net... Res Net-18 on CIFAR-10... LSTM on Wikitext... NCF on Movielens-20M... |
| Dataset Splits | No | The paper mentions training, testing, and sometimes validation in context of experiments, but it does not specify explicit percentages or methods for creating dataset splits, such as '80/10/10 split' or specific sample counts. |
| Hardware Specification | No | All experiments were run on an 8-GPU cluster, using Allgather as the communication primitive. The paper does not provide specific GPU models (e.g., NVIDIA A100), CPU models, or detailed cloud/cluster resource specifications. |
| Software Dependencies | No | For instance, Py Torch uses Radix select algorithm [5]... Py Torch. https://pytorch.org/. The paper mentions PyTorch but does not provide a specific version number, nor does it list other software dependencies with their versions. |
| Experiment Setup | Yes | We train for 10 epochs and set k = 0.17% for Top-k, and λ = 4.2 for hard-threshold. We use different optimizers: vanilla SGD, SGD with Nesterov momentum, and ADAM [34]. In Figure 3, we introspect a run with average density of 0.06% from Figure 2a. In Figure 3a, while hard-threshold converges to an accuracy of 93.9%, Top-k achieves 91.1% accuracy. At the same time, in Figure 3b, we observe large error-accumulation in the initial 1, 200 iterations for Top-k. Consequently, hard-threshold has a significantly lower total-error than Top-k, and therefore has better convergence. This observation about large error accumulation for Top-k is consistent across all our benchmarks (see C.2). The λ in Table 1 is derived from simplifying this formula. |