Towards Crowdsourced Training of Large Neural Networks using Decentralized Mixture-of-Experts
Authors: Max Ryabinin, Anton Gusev
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we run several benchmarks in order to verify these assumptions. We intentionally focus on small-scale experiments to make them easier to reproduce and analyze. While solving practical vision and NLP problems is certainly our end goal, choosing a particular task would make it much harder to understand the general properties of our approach. |
| Researcher Affiliation | Collaboration | Max Ryabinin Yandex National Research University Higher School of Economics mryabinin@hse.ru Anton Gusev Independent uartman@mail.ru |
| Pseudocode | No | The paper describes procedures and uses diagrams (e.g., Figure 2) but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | The Py Torch source code that can be used to reproduce our results is available online3. 3https://github.com/mryab/learning-at-home |
| Open Datasets | Yes | For this goal, we choose one of the simpler tasks in deep learning, namely the MNIST digit recognition dataset [63], and compare convergence rates under varying network latency. Specifically, we train Transformer XL [64] on the Wiki Text-2 [65] dataset. |
| Dataset Splits | No | The paper uses “Validation accuracy” and “Validation perplexity” in its figures (Figure 5, Figure 6), implying the use of a validation set, but does not explicitly provide the split percentages or methodology for creating the validation split. |
| Hardware Specification | Yes | We create a model from a large number of identical blocks distributed evenly across 4 NVIDIA GTX 1080 GPUs. In particular, we rented 3 instances with Tesla K80 hosted in West US, East US, and West Europe |
| Software Dependencies | No | The paper mentions “Py Torch” in relation to its source code but does not provide specific version numbers for PyTorch or any other software dependencies. |
| Experiment Setup | Yes | For Learning@home, we use 64 trainer processes to send requests to the runtime processes8. In the high-latency scenario, each of 64 workers is delayed for 1 second on average while processing a batch. This corresponds to 125ms for each forward and backward pass through DMo E. For low latency emulation, we use 16 workers and 100ms average delay. The third experiment simulates node failure: each expert does not respond to a request with probability 0.1. Our DMo E Transformer uses 256 experts split evenly between 16 layers. Each expert is a Transformer layer with the same dimensions as layers of the small baseline model. The DMo E layers route to top-4 experts, making our model roughly equivalent to base in terms of FLOPs per sample. Similarly to Section 4.2, we train DMo E with 32 trainers (batch size 1 each), 1000ms average latency, and 10% failure rate. |