Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models
Authors: Yunfei Teng, Wenbo Gao, François Chalus, Anna E. Choromanska, Donald Goldfarb, Adrian Weller
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines. |
| Researcher Affiliation | Academia | Yunfei Teng ,1 yt1208@nyu.edu Wenbo Gao ,2 wg2279@columbia.edu Francois Chalus chalusf3@gmail.com Anna Choromanska ac5455@nyu.edu Donald Goldfarb goldfarb@columbia.edu Adrian Weller aw665@cam.ac.uk |
| Pseudocode | Yes | Algorithm 1 LSGD Algorithm (Asynchronous) |
| Open Source Code | Yes | The codes for LSGD can be found at https://github. com/yunfei-teng/LSGD. |
| Open Datasets | Yes | The experiments were performed using the CIFAR-10 data set [33] on three benchmark architectures: 7-layer CNN used in the original EASGD paper (see Section 5.1. in [1]) that we refer to as CNN7, VGG16 [34], and Res Net20 [35]; and Image Net (ILSVRC 2012) data set [36] on Res Net50. |
| Dataset Splits | No | The paper uses standard datasets (CIFAR-10, ImageNet) but does not explicitly provide specific training/validation/test dataset splits, percentages, or absolute sample counts within the main text for reproduction. |
| Hardware Specification | Yes | We use GPU nodes interconnected with Ethernet. Each GPU node has four GTX 1080 GPU processors where each local worker corresponds to one GPU processor. |
| Software Dependencies | Yes | We use CUDA Toolkit 10.03 and NCCL 24. |
| Experiment Setup | Yes | During training, we select the leader for the LSGD method based on the average of the training loss computed over the last 10 (CIFAR-10) and 64 (Image Net) data batches. For all methods we use weight decay with decay coefficient set to 10 4. We use constant learning rate for CNN7 and learning rate drop (we divide the learning rate by 10 when we observe saturation of the optimizer) for VGG16, Res Net20, and Res Net50. In our experiments we use either 4 workers (single-leader LSGD setting) or 16 workers (multi-leader LSGD setting with 4 groups of workers). We run EASGD and LSGD with communication period τ = 64. We used τG = 128 for the multi-leader LSGD case. |