Rethinking gradient sparsification as total error minimization

Authors: Atal Sahu, Aritra Dutta, Ahmed M. Abdelmoniem, Trambak Banerjee, Marco Canini, Panos Kalnis

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our diverse experiments on various DNNs and a logistic regression model demonstrate that the hardthreshold sparsifier is more communication-efficient than Top-k. Code is available at https://github.com/sands-lab/rethinking-sparsification. Experiments ( 6). We conduct diverse experiments on both strongly convex and non-convex (for DNNs) loss functions to substantiate our claims. Our DNN experiments include computer vision, language modeling, and recommendation tasks, and our strongly convex experiment is on logistic regression.
Researcher Affiliation Academia Atal Narayan Sahu KAUST Aritra Dutta KAUST Ahmed M. Abdelmoniem KAUST Trambak Banerjee University of Kansas Marco Canini KAUST Panos Kalnis KAUST
Pseudocode Yes Algorithm 1: Distributed EF SGD
Open Source Code Yes Code is available at https://github.com/sands-lab/rethinking-sparsification.
Open Datasets Yes Our diverse experiments on various DNNs and a logistic regression model... logistic regression model on the gisette LIBSVM dataset [14]... Res Net-18 on CIFAR-100... Res Net-50 on Image Net... Res Net-18 on CIFAR-10... LSTM on Wikitext... NCF on Movielens-20M...
Dataset Splits No The paper mentions training, testing, and sometimes validation in context of experiments, but it does not specify explicit percentages or methods for creating dataset splits, such as '80/10/10 split' or specific sample counts.
Hardware Specification No All experiments were run on an 8-GPU cluster, using Allgather as the communication primitive. The paper does not provide specific GPU models (e.g., NVIDIA A100), CPU models, or detailed cloud/cluster resource specifications.
Software Dependencies No For instance, Py Torch uses Radix select algorithm [5]... Py Torch. https://pytorch.org/. The paper mentions PyTorch but does not provide a specific version number, nor does it list other software dependencies with their versions.
Experiment Setup Yes We train for 10 epochs and set k = 0.17% for Top-k, and λ = 4.2 for hard-threshold. We use different optimizers: vanilla SGD, SGD with Nesterov momentum, and ADAM [34]. In Figure 3, we introspect a run with average density of 0.06% from Figure 2a. In Figure 3a, while hard-threshold converges to an accuracy of 93.9%, Top-k achieves 91.1% accuracy. At the same time, in Figure 3b, we observe large error-accumulation in the initial 1, 200 iterations for Top-k. Consequently, hard-threshold has a significantly lower total-error than Top-k, and therefore has better convergence. This observation about large error accumulation for Top-k is consistent across all our benchmarks (see C.2). The λ in Table 1 is derived from simplifying this formula.