reproducibilityindex.ai

Rethinking gradient sparsification as total error minimization

Authors: Atal Sahu, Aritra Dutta, Ahmed M. Abdelmoniem, Trambak Banerjee, Marco Canini, Panos Kalnis

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our diverse experiments on various DNNs and a logistic regression model demonstrate that the hardthreshold sparsiﬁer is more communication-efﬁcient than Top-k. Code is available at https://github.com/sands-lab/rethinking-sparsification. Experiments ( 6). We conduct diverse experiments on both strongly convex and non-convex (for DNNs) loss functions to substantiate our claims. Our DNN experiments include computer vision, language modeling, and recommendation tasks, and our strongly convex experiment is on logistic regression.
Researcher Affiliation	Academia	Atal Narayan Sahu KAUST Aritra Dutta KAUST Ahmed M. Abdelmoniem KAUST Trambak Banerjee University of Kansas Marco Canini KAUST Panos Kalnis KAUST
Pseudocode	Yes	Algorithm 1: Distributed EF SGD
Open Source Code	Yes	Code is available at https://github.com/sands-lab/rethinking-sparsification.
Open Datasets	Yes	Our diverse experiments on various DNNs and a logistic regression model... logistic regression model on the gisette LIBSVM dataset [14]... Res Net-18 on CIFAR-100... Res Net-50 on Image Net... Res Net-18 on CIFAR-10... LSTM on Wikitext... NCF on Movielens-20M...
Dataset Splits	No	The paper mentions training, testing, and sometimes validation in context of experiments, but it does not specify explicit percentages or methods for creating dataset splits, such as '80/10/10 split' or specific sample counts.
Hardware Specification	No	All experiments were run on an 8-GPU cluster, using Allgather as the communication primitive. The paper does not provide specific GPU models (e.g., NVIDIA A100), CPU models, or detailed cloud/cluster resource specifications.
Software Dependencies	No	For instance, Py Torch uses Radix select algorithm [5]... Py Torch. https://pytorch.org/. The paper mentions PyTorch but does not provide a specific version number, nor does it list other software dependencies with their versions.
Experiment Setup	Yes	We train for 10 epochs and set k = 0.17% for Top-k, and λ = 4.2 for hard-threshold. We use different optimizers: vanilla SGD, SGD with Nesterov momentum, and ADAM [34]. In Figure 3, we introspect a run with average density of 0.06% from Figure 2a. In Figure 3a, while hard-threshold converges to an accuracy of 93.9%, Top-k achieves 91.1% accuracy. At the same time, in Figure 3b, we observe large error-accumulation in the initial 1, 200 iterations for Top-k. Consequently, hard-threshold has a signiﬁcantly lower total-error than Top-k, and therefore has better convergence. This observation about large error accumulation for Top-k is consistent across all our benchmarks (see C.2). The λ in Table 1 is derived from simplifying this formula.