Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Rethinking gradient sparsification as total error minimization
Authors: Atal Sahu, Aritra Dutta, Ahmed M. Abdelmoniem, Trambak Banerjee, Marco Canini, Panos Kalnis
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our diverse experiments on various DNNs and a logistic regression model demonstrate that the hardthreshold sparsifier is more communication-efficient than Top-k. Code is available at https://github.com/sands-lab/rethinking-sparsification. Experiments ( 6). We conduct diverse experiments on both strongly convex and non-convex (for DNNs) loss functions to substantiate our claims. Our DNN experiments include computer vision, language modeling, and recommendation tasks, and our strongly convex experiment is on logistic regression. |
| Researcher Affiliation | Academia | Atal Narayan Sahu KAUST Aritra Dutta KAUST Ahmed M. Abdelmoniem KAUST Trambak Banerjee University of Kansas Marco Canini KAUST Panos Kalnis KAUST |
| Pseudocode | Yes | Algorithm 1: Distributed EF SGD |
| Open Source Code | Yes | Code is available at https://github.com/sands-lab/rethinking-sparsification. |
| Open Datasets | Yes | Our diverse experiments on various DNNs and a logistic regression model... logistic regression model on the gisette LIBSVM dataset [14]... Res Net-18 on CIFAR-100... Res Net-50 on Image Net... Res Net-18 on CIFAR-10... LSTM on Wikitext... NCF on Movielens-20M... |
| Dataset Splits | No | The paper mentions training, testing, and sometimes validation in context of experiments, but it does not specify explicit percentages or methods for creating dataset splits, such as '80/10/10 split' or specific sample counts. |
| Hardware Specification | No | All experiments were run on an 8-GPU cluster, using Allgather as the communication primitive. The paper does not provide specific GPU models (e.g., NVIDIA A100), CPU models, or detailed cloud/cluster resource specifications. |
| Software Dependencies | No | For instance, Py Torch uses Radix select algorithm [5]... Py Torch. https://pytorch.org/. The paper mentions PyTorch but does not provide a specific version number, nor does it list other software dependencies with their versions. |
| Experiment Setup | Yes | We train for 10 epochs and set k = 0.17% for Top-k, and λ = 4.2 for hard-threshold. We use different optimizers: vanilla SGD, SGD with Nesterov momentum, and ADAM [34]. In Figure 3, we introspect a run with average density of 0.06% from Figure 2a. In Figure 3a, while hard-threshold converges to an accuracy of 93.9%, Top-k achieves 91.1% accuracy. At the same time, in Figure 3b, we observe large error-accumulation in the initial 1, 200 iterations for Top-k. Consequently, hard-threshold has a significantly lower total-error than Top-k, and therefore has better convergence. This observation about large error accumulation for Top-k is consistent across all our benchmarks (see C.2). The λ in Table 1 is derived from simplifying this formula. |