ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training
Authors: Chia-Yu Chen, Jiamin Ni, Songtao Lu, Xiaodong Cui, Pin-Yu Chen, Xiao Sun, Naigang Wang, Swagath Venkataramani, Vijayalakshmi (Viji) Srinivasan, Wei Zhang, Kailash Gopalakrishnan
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experimental Results We apply Scale Com to three major applications: vision (Image Net, CIFAR10), language (WMT14 En-De), and speech (SWB300). |
| Researcher Affiliation | Industry | Chia-Yu Chen1, Jiamin Ni2, Songtao Lu2, Xiaodong Cui1, Pin-Yu Chen2 Xiao Sun1, Naigang Wang1, Swagath Venkataramani2 Vijayalakshmi Srinivasan1, Wei Zhang1, Kailash Gopalakrishnan1 IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA 1{cchen, cuix, xsun, nwang, viji, weiz, kailash}@us.ibm.com 2{jiamin.ni, songtao, pin-yu.chen, swagath.venkataramani}@ibm.com |
| Pseudocode | Yes | Algorithm 1 Scale Com: Scalable Sparsified Gradient Compression |
| Open Source Code | No | The paper does not provide an explicit statement about releasing the source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We apply Scale Com to three major applications: vision (Image Net, CIFAR10), language (WMT14 En-De), and speech (SWB300). |
| Dataset Splits | No | The paper mentions various datasets (Image Net, CIFAR10, WMT14 En-De, SWB300) and batch sizes, but does not explicitly provide specific percentages, sample counts, or detailed methodology for dataset splits (e.g., train/validation/test) to enable reproduction of the data partitioning. |
| Hardware Specification | Yes | Experiments are run on IBM POWER System AC922 systems using implementations in Py Torch. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not specify its version number or any other software dependencies with their versions. |
| Experiment Setup | Yes | We use 1-5 warm-up epochs (<10% total training epochs) for compression. A conservative engineering guidance is proposed for compression rate settings in each layer based upon the ratio FLOPs/gradient: 25X for ratio in the range [196, 1]; 50X for [128, 196], and 400X for (0, 128]. ... this guidance is based on the per-worker mini-batch size, 32 for vision and speech and 4.5k for language. ... In these experiments, we adopt hyper-parameter settings from [1][3][5] (including learning rates and momentum)... we set β=1 in the low-pass filter... Once the proposed low-pass filter is applied (β=0.1), Scale Com achieves almost identical test accuracies. |