Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts

Authors: Max Ryabinin, Anton Gusev

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we run several benchmarks in order to verify these assumptions. We intentionally focus on small-scale experiments to make them easier to reproduce and analyze. While solving practical vision and NLP problems is certainly our end goal, choosing a particular task would make it much harder to understand the general properties of our approach.
Researcher Affiliation Collaboration Max Ryabinin Yandex National Research University Higher School of Economics mryabinin@hse.ru Anton Gusev Independent uartman@mail.ru
Pseudocode No The paper describes procedures and uses diagrams (e.g., Figure 2) but does not include formal pseudocode or algorithm blocks.
Open Source Code Yes The Py Torch source code that can be used to reproduce our results is available online3. 3https://github.com/mryab/learning-at-home
Open Datasets Yes For this goal, we choose one of the simpler tasks in deep learning, namely the MNIST digit recognition dataset [63], and compare convergence rates under varying network latency. Specifically, we train Transformer XL [64] on the Wiki Text-2 [65] dataset.
Dataset Splits No The paper uses “Validation accuracy” and “Validation perplexity” in its figures (Figure 5, Figure 6), implying the use of a validation set, but does not explicitly provide the split percentages or methodology for creating the validation split.
Hardware Specification Yes We create a model from a large number of identical blocks distributed evenly across 4 NVIDIA GTX 1080 GPUs. In particular, we rented 3 instances with Tesla K80 hosted in West US, East US, and West Europe
Software Dependencies No The paper mentions “Py Torch” in relation to its source code but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes For Learning@home, we use 64 trainer processes to send requests to the runtime processes8. In the high-latency scenario, each of 64 workers is delayed for 1 second on average while processing a batch. This corresponds to 125ms for each forward and backward pass through DMo E. For low latency emulation, we use 16 workers and 100ms average delay. The third experiment simulates node failure: each expert does not respond to a request with probability 0.1. Our DMo E Transformer uses 256 experts split evenly between 16 layers. Each expert is a Transformer layer with the same dimensions as layers of the small baseline model. The DMo E layers route to top-4 experts, making our model roughly equivalent to base in terms of FLOPs per sample. Similarly to Section 4.2, we train DMo E with 32 trainers (batch size 1 each), 1000ms average latency, and 10% failure rate.