$\textbf{A}^2\textbf{CiD}^2$: Accelerating Asynchronous Communication in Decentralized Deep Learning
Authors: Adel Nabli, Eugene Belilovsky, Edouard Oyallon
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Numerical Experiments Now, we experimentally compare A2Ci D2 to a synchronous baseline All-Reduce SGD (AR-SGD, see [26]) and an asynchronous baseline using randomized pairwise communications (a variant of AD-PSGD [28], traditionally used in state-of-the-art decentralized asynchronous training of DNNs). |
| Researcher Affiliation | Academia | Adel Nabli Concordia University, Mila Sorbonne University, ISIR, CNRS adel.nabli@sorbonne-universite.fr Eugene Belilovsky Concordia University, Mila Edouard Oyallon Sorbonne University, ISIR, CNRS |
| Pseudocode | Yes | Algorithm 1: This algorithm block describes our implementation of our Asynchronous algorithm with A2Ci D2on each local machine. |
| Open Source Code | Yes | Our code is implemented in Pytorch [35], remove locks put on previous asynchronous implementations by circumventing their deadlocks, and can be found in an open-source repository: https://github.com/Adel Nabli/ACi D. |
| Open Datasets | Yes | Following [2], we pick a Res Net18 for CIFAR-10 [24] and Res Net50 for Image Net [11]. |
| Dataset Splits | No | The paper mentions using CIFAR-10 and ImageNet and states that for the asynchronous setting, they give 'access to the whole dataset to all workers, each one shuffling it with a different random seed,' rather than splitting it. It does not provide explicit percentages or sample counts for training, validation, or test splits. |
| Hardware Specification | Yes | In particular, we show consistent improvement on the Image Net dataset using up to 64 asynchronous workers (A100 GPUs) and various communication network topologies. |
| Software Dependencies | No | The paper mentions 'Pytorch [35]' as the implementation framework but does not specify a version number for Pytorch or any other software dependency. |
| Experiment Setup | Yes | We fixed the local batch size to 128 on both CIFAR-10 and Image Net. We use SGD with a base learning rate of 0.1, a momentum value set at 0.9 and 5 10 4 for weight decay. As advocated in [16], we do not apply weight decay on the learnable batch-norm coefficients. For Image Net training with the SGD baseline, we decay the learning-rate by a factor of 10 at epochs 30, 60, 80 (epochs 50, 75 for CIFAR-10), and apply an analogous decay schedule with our asynchronous decentralized methods. |