DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Authors: Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-trained both the dense and Mo E models on 128 NVIDIA Ampere A100 GPUs (Azure ND A100 instances), using the same training data as described in (Microsoft & Nvidia, 2021). We use 300B tokens to train both dense and Mo E models. In addition to the pre-training validation loss, we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013).", "Figure 8 shows that both Deep Speed-Mo E and Py Torch reduce the inference latency as we increase the number of GPUs, as expected, although Py Torch is much slower compared to Deep Speed-Mo E. |
| Researcher Affiliation | Industry | 1Microsoft. Correspondence to: Samyam Rajbhandari <samyamr@microsoft.com>, Yuxiong He <yuxhe@microsoft.com>. |
| Pseudocode | No | The paper describes its methods in prose and with architectural diagrams but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The generic Deep Speed-Mo E end-to-end framework for training and inference of Mo E-based models is open-sourced as part of the Deep Speed software... Please find the code, tutorials, and documents at Deep Speed Git Hub (https: //github.com/microsoft/Deep Speed) and website (https://www.deepspeed.ai/). |
| Open Datasets | Yes | We pre-trained both the dense and Mo E models... using the same training data as described in (Microsoft & Nvidia, 2021). ...we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013). |
| Dataset Splits | Yes | In addition to the pre-training validation loss, we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013).", "Figure 1 shows that the validation loss of the Mo E models is significantly better than their dense counter parts |
| Hardware Specification | Yes | We pre-trained both the dense and Mo E models on 128 NVIDIA Ampere A100 GPUs (Azure ND A100 instances) |
| Software Dependencies | No | The paper mentions software like 'Deep Speed software' and 'Py Torch implementation' (e.g., 'Py Torch-Mo E Deep Speed-Mo E' in Figure 7), but it does not specify any version numbers for these or other software dependencies required for replication. |
| Experiment Setup | Yes | Appendix B. Mo E-based NLG Model Training and Evaluation Settings. Table 3 summarizes the hyperparameters for training the dense and Mo E models." (Table 3 details Num. layers, Hidden size, Num. attention heads, Num. experts per layer, Num. parameters, Context/sequence length, Training tokens, Batch size, Learning rate, etc.) |