Distributed Training with Heterogeneous Data: Bridging Median- and Mean-Based Algorithms

Authors: Xiangyi Chen, Tiancong Chen, Haoran Sun, Steven Z. Wu, Mingyi Hong

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we show how adding noise helps the practical behavior of the algorithms. Since SIGNSGD is better studied empirically and MEDIANSGD is more of theoretical interest so far, we use SIGNSGD to demonstrate the benefit of injecting noise. We conduct experiments on MNIST and CIFAR-10 datasets.
Researcher Affiliation Academia Xiangyi Chen University of Minnesota chen5719@umn.edu Tiancong Chen University of Minnesota chen6271@umn.edu Haoran Sun University of Minnesota sun00111@umn.edu Zhiwei Steven Wu Carnegie Mellon University zstevenwu@cmu.edu Mingyi Hong University of Minnesota mhong@umn.edu
Pseudocode Yes Algorithm 1 SIGNSGD (with M nodes), Algorithm 2 MEDIANSGD (with M nodes), Algorithm 3 Noisy SIGNSGD, Algorithm 4 Noisy MEDIANSGD
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets Yes We conduct experiments on MNIST and CIFAR-10 datasets.
Dataset Splits No The paper does not provide specific dataset split information (e.g., percentages, sample counts, or citations to predefined splits) for training, validation, and testing.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup Yes We conduct experiments on MNIST and CIFAR-10 datasets. For both datasets, the data distribution on each node is heterogeneous, more specifically, each node contains some exclusive data for one or two out of ten categories. More details about the experiment configuration can be found in Appendix I. For the noisy algorithms we use b = 0.001. The sudden change of performance is caused by learning rate decay, which happens at 1000/3000/5000 iterations.