reproducibilityindex.ai

Leader Stochastic Gradient Descent for Distributed Training of Deep Learning Models

Authors: Yunfei Teng, Wenbo Gao, François Chalus, Anna E. Choromanska, Donald Goldfarb, Adrian Weller

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines.
Researcher Affiliation	Academia	Yunfei Teng ,1 yt1208@nyu.edu Wenbo Gao ,2 wg2279@columbia.edu Francois Chalus chalusf3@gmail.com Anna Choromanska ac5455@nyu.edu Donald Goldfarb goldfarb@columbia.edu Adrian Weller aw665@cam.ac.uk
Pseudocode	Yes	Algorithm 1 LSGD Algorithm (Asynchronous)
Open Source Code	Yes	The codes for LSGD can be found at https://github. com/yunfei-teng/LSGD.
Open Datasets	Yes	The experiments were performed using the CIFAR-10 data set [33] on three benchmark architectures: 7-layer CNN used in the original EASGD paper (see Section 5.1. in [1]) that we refer to as CNN7, VGG16 [34], and Res Net20 [35]; and Image Net (ILSVRC 2012) data set [36] on Res Net50.
Dataset Splits	No	The paper uses standard datasets (CIFAR-10, ImageNet) but does not explicitly provide specific training/validation/test dataset splits, percentages, or absolute sample counts within the main text for reproduction.
Hardware Specification	Yes	We use GPU nodes interconnected with Ethernet. Each GPU node has four GTX 1080 GPU processors where each local worker corresponds to one GPU processor.
Software Dependencies	Yes	We use CUDA Toolkit 10.03 and NCCL 24.
Experiment Setup	Yes	During training, we select the leader for the LSGD method based on the average of the training loss computed over the last 10 (CIFAR-10) and 64 (Image Net) data batches. For all methods we use weight decay with decay coefﬁcient set to 10 4. We use constant learning rate for CNN7 and learning rate drop (we divide the learning rate by 10 when we observe saturation of the optimizer) for VGG16, Res Net20, and Res Net50. In our experiments we use either 4 workers (single-leader LSGD setting) or 16 workers (multi-leader LSGD setting with 4 groups of workers). We run EASGD and LSGD with communication period τ = 64. We used τG = 128 for the multi-leader LSGD case.