$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning

Authors: Adel Nabli, Eugene Belilovsky, Edouard Oyallon

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Numerical Experiments Now, we experimentally compare A2Ci D2 to a synchronous baseline All-Reduce SGD (AR-SGD, see [26]) and an asynchronous baseline using randomized pairwise communications (a variant of AD-PSGD [28], traditionally used in state-of-the-art decentralized asynchronous training of DNNs).
Researcher Affiliation Academia Adel Nabli Concordia University, Mila Sorbonne University, ISIR, CNRS adel.nabli@sorbonne-universite.fr Eugene Belilovsky Concordia University, Mila Edouard Oyallon Sorbonne University, ISIR, CNRS
Pseudocode Yes Algorithm 1: This algorithm block describes our implementation of our Asynchronous algorithm with A2Ci D2on each local machine.
Open Source Code Yes Our code is implemented in Pytorch [35], remove locks put on previous asynchronous implementations by circumventing their deadlocks, and can be found in an open-source repository: https://github.com/Adel Nabli/ACi D.
Open Datasets Yes Following [2], we pick a Res Net18 for CIFAR-10 [24] and Res Net50 for Image Net [11].
Dataset Splits No The paper mentions using CIFAR-10 and ImageNet and states that for the asynchronous setting, they give 'access to the whole dataset to all workers, each one shuffling it with a different random seed,' rather than splitting it. It does not provide explicit percentages or sample counts for training, validation, or test splits.
Hardware Specification Yes In particular, we show consistent improvement on the Image Net dataset using up to 64 asynchronous workers (A100 GPUs) and various communication network topologies.
Software Dependencies No The paper mentions 'Pytorch [35]' as the implementation framework but does not specify a version number for Pytorch or any other software dependency.
Experiment Setup Yes We fixed the local batch size to 128 on both CIFAR-10 and Image Net. We use SGD with a base learning rate of 0.1, a momentum value set at 0.9 and 5 10 4 for weight decay. As advocated in [16], we do not apply weight decay on the learnable batch-norm coefficients. For Image Net training with the SGD baseline, we decay the learning-rate by a factor of 10 at epochs 30, 60, 80 (epochs 50, 75 for CIFAR-10), and apply an analogous decay schedule with our asynchronous decentralized methods.