ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Authors: Chia-Yu Chen, Jiamin Ni, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Xiao Sun, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Wei Zhang, Kailash Gopalakrishnan

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Experimental Results We apply Scale Com to three major applications: vision (Image Net, CIFAR10), language (WMT14 En-De), and speech (SWB300).
Researcher Affiliation Industry Chia-Yu Chen1, Jiamin Ni2, Songtao Lu2, Xiaodong Cui1, Pin-Yu Chen2 Xiao Sun1, Naigang Wang1, Swagath Venkataramani2 Vijayalakshmi Srinivasan1, Wei Zhang1, Kailash Gopalakrishnan1 IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA 1{cchen, cuix, xsun, nwang, viji, weiz, kailash}@us.ibm.com 2{jiamin.ni, songtao, pin-yu.chen, swagath.venkataramani}@ibm.com
Pseudocode Yes Algorithm 1 Scale Com: Scalable Sparsified Gradient Compression
Open Source Code No The paper does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository.
Open Datasets Yes We apply Scale Com to three major applications: vision (Image Net, CIFAR10), language (WMT14 En-De), and speech (SWB300).
Dataset Splits No The paper mentions various datasets (Image Net, CIFAR10, WMT14 En-De, SWB300) and batch sizes, but does not explicitly provide specific percentages, sample counts, or detailed methodology for dataset splits (e.g., train/validation/test) to enable reproduction of the data partitioning.
Hardware Specification Yes Experiments are run on IBM POWER System AC922 systems using implementations in Py Torch.
Software Dependencies No The paper mentions 'Py Torch' but does not specify its version number or any other software dependencies with their versions.
Experiment Setup Yes We use 1-5 warm-up epochs (<10% total training epochs) for compression. A conservative engineering guidance is proposed for compression rate settings in each layer based upon the ratio FLOPs/gradient: 25X for ratio in the range [196, 1]; 50X for [128, 196], and 400X for (0, 128]. ... this guidance is based on the per-worker mini-batch size, 32 for vision and speech and 4.5k for language. ... In these experiments, we adopt hyper-parameter settings from [1][3][5] (including learning rates and momentum)... we set β=1 in the low-pass filter... Once the proposed low-pass filter is applied (β=0.1), Scale Com achieves almost identical test accuracies.