Distributed Deep Learning In Open Collaborations
Authors: Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, quentin lhoest, Anton Sinitsin, Dmitry Popov, Dmitry V. Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur, Ilia Kobelev, Yacine Jernite, Thomas Wolf, Gennady Pekhimenko
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of our approach for Sw AV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. |
| Researcher Affiliation | Collaboration | Yandex, Russia Hugging Face, USA HSE University, Russia Moscow Institute of Physics and Technology, Russia University of Toronto, Canada Vector Institute, Canada |
| Pseudocode | No | While the paper describes algorithmic details and problem formulations (e.g., as a linear program), it does not include a formally labeled 'Pseudocode' or 'Algorithm' block with structured steps. |
| Open Source Code | Yes | Code and training configurations are available at github.com/yandex-research/De DLOC |
| Open Datasets | Yes | We train the Res Net-50 [93] model on the Image Net dataset [1] without labels. and pretrain the ALBERT-large [7] masked language model on the Wiki Text-103 dataset [97]. and trained the ALBERT-large model on Wikipedia and the Bengali part of the OSCAR [100] multilingual corpus. |
| Dataset Splits | No | The paper mentions dataset usage like 'Wiki Text-103 dataset' and 'Image Net dataset' but does not specify the explicit train/validation/test splits (e.g., percentages or exact counts) for reproducibility. |
| Hardware Specification | Yes | We train with three hardware setups: SERVER, WORKSTATION and HYBRID. The SERVER setup contains 8 workers, each with a single V100 GPU and 1 Gb/s symmetric bandwidth. In turn, the WORKSTATION setup consists of 16 nodes with 1080 Ti and 200 Mb/s bandwidth per worker. and We run all experiments on cloud instances with Tesla T4 GPUs. |
| Software Dependencies | No | The paper mentions using 'Hivemind [95]' and 'the transformers library [99]' but does not provide specific version numbers for these software dependencies, which is necessary for reproducibility. |
| Experiment Setup | Yes | Our experiments follow the recommended training configuration [92, 94]: 2+6 random crops, early prototype freezing and a queue with 3,840 samples for each worker, LARS [78] optimizer, and 32,768 samples per batch across all workers. |