The Convergence of Sparsified Gradient Methods

Authors: Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, Cedric Renggli

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate Assumption 1 experimentally on a number of different learning tasks in Section 6 (see also Figure 1).
Researcher Affiliation Academia Dan Alistarh IST Austria dan.alistarh@ist.ac.at Torsten Hoefler ETH Zurich htor@inf.ethz.ch Mikael Johansson KTH mikaelj@kth.se Sarit Khirirat KTH sarit@kth.se Nikola Konstantinov IST Austria nikola.konstantinov@ist.ac.at Cédric Renggli ETH Zurich cedric.renggli@inf.ethz.ch
Pseudocode Yes Algorithm 1 Parallel Top K SGD at a node p. Input: Stochastic Gradient Oracle Gp( ) at node p Input: value K, learning rate Initialize v0 = p 0 = ~0 for each step t 1 do t (vt 1) {accumulate error into a locally generated gradient} p t Top K(accp t ) {update the error} Broadcast(Top K(accp t ), SUM) { broadcast to all nodes and receive from all nodes } gt 1 q=1 Top K(accq t) { average the received (sparse) gradients } vt vt 1 gt { apply the update } end for
Open Source Code No The paper provides a link to an arXiv preprint of the full version, but no explicit statement or link to source code for the described methodology.
Open Datasets Yes We validate Assumption 1 experimentally on a number of different learning tasks in Section 6 (see also Figure 1). Specifically, we sample gradients at different epochs during the training process, and bound the constant by comparing the left and right-hand sides of Equation (8). The assumption appears to hold with relatively low, stable values of the constant . We note that RCV1 is relatively sparse (average density ' 10%), while gradients in the other two settings are fully dense. (a) Empirical logistic/RCV1. (c) Empirical Res Net110. (b) Empirical synthetic.
Dataset Splits No The paper mentions using RCV1, synthetic data, and ResNet110 on CIFAR-10, but does not provide specific details on training, validation, or test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) within the provided text.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software like TensorFlow and MXNet in the related work, but does not specify any ancillary software dependencies with version numbers for their own experiments.
Experiment Setup No The paper states, "Exact descriptions of the experimental setup are given in the full version of the paper [5]", indicating these details are not in the provided text. It does not provide concrete hyperparameter values or detailed training configurations.