DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Authors: Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pre-trained both the dense and Mo E models on 128 NVIDIA Ampere A100 GPUs (Azure ND A100 instances), using the same training data as described in (Microsoft & Nvidia, 2021). We use 300B tokens to train both dense and Mo E models. In addition to the pre-training validation loss, we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013).", "Figure 8 shows that both Deep Speed-Mo E and Py Torch reduce the inference latency as we increase the number of GPUs, as expected, although Py Torch is much slower compared to Deep Speed-Mo E.
Researcher Affiliation Industry 1Microsoft. Correspondence to: Samyam Rajbhandari <samyamr@microsoft.com>, Yuxiong He <yuxhe@microsoft.com>.
Pseudocode No The paper describes its methods in prose and with architectural diagrams but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The generic Deep Speed-Mo E end-to-end framework for training and inference of Mo E-based models is open-sourced as part of the Deep Speed software... Please find the code, tutorials, and documents at Deep Speed Git Hub (https: //github.com/microsoft/Deep Speed) and website (https://www.deepspeed.ai/).
Open Datasets Yes We pre-trained both the dense and Mo E models... using the same training data as described in (Microsoft & Nvidia, 2021). ...we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013).
Dataset Splits Yes In addition to the pre-training validation loss, we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013).", "Figure 1 shows that the validation loss of the Mo E models is significantly better than their dense counter parts
Hardware Specification Yes We pre-trained both the dense and Mo E models on 128 NVIDIA Ampere A100 GPUs (Azure ND A100 instances)
Software Dependencies No The paper mentions software like 'Deep Speed software' and 'Py Torch implementation' (e.g., 'Py Torch-Mo E Deep Speed-Mo E' in Figure 7), but it does not specify any version numbers for these or other software dependencies required for replication.
Experiment Setup Yes Appendix B. Mo E-based NLG Model Training and Evaluation Settings. Table 3 summarizes the hyperparameters for training the dense and Mo E models." (Table 3 details Num. layers, Hidden size, Num. attention heads, Num. experts per layer, Num. parameters, Context/sequence length, Training tokens, Batch size, Learning rate, etc.)