Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
Authors: Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He
ICML 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-trained both the dense and Mo E models on 128 NVIDIA Ampere A100 GPUs (Azure ND A100 instances), using the same training data as described in (Microsoft & Nvidia, 2021). We use 300B tokens to train both dense and Mo E models. In addition to the pre-training validation loss, we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013).", "Figure 8 shows that both Deep Speed-Mo E and Py Torch reduce the inference latency as we increase the number of GPUs, as expected, although Py Torch is much slower compared to Deep Speed-Mo E. |
| Researcher Affiliation | Industry | 1Microsoft. Correspondence to: Samyam Rajbhandari <EMAIL>, Yuxiong He <EMAIL>. |
| Pseudocode | No | The paper describes its methods in prose and with architectural diagrams but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The generic Deep Speed-Mo E end-to-end framework for training and inference of Mo E-based models is open-sourced as part of the Deep Speed software... Please find the code, tutorials, and documents at Deep Speed Git Hub (https: //github.com/microsoft/Deep Speed) and website (https://www.deepspeed.ai/). |
| Open Datasets | Yes | We pre-trained both the dense and Mo E models... using the same training data as described in (Microsoft & Nvidia, 2021). ...we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013). |
| Dataset Splits | Yes | In addition to the pre-training validation loss, we employ 6 zero-shot evaluation tasks to compare the final model quality: LAMBADA (Paperno et al., 2016), PIQA (Bisk et al., 2020), Bool Q (Wang et al., 2019), RACE-h (Lai et al., 2017), Trivia QA (Joshi et al., 2017), Web Qs (Berant et al., 2013).", "Figure 1 shows that the validation loss of the Mo E models is significantly better than their dense counter parts |
| Hardware Specification | Yes | We pre-trained both the dense and Mo E models on 128 NVIDIA Ampere A100 GPUs (Azure ND A100 instances) |
| Software Dependencies | No | The paper mentions software like 'Deep Speed software' and 'Py Torch implementation' (e.g., 'Py Torch-Mo E Deep Speed-Mo E' in Figure 7), but it does not specify any version numbers for these or other software dependencies required for replication. |
| Experiment Setup | Yes | Appendix B. Mo E-based NLG Model Training and Evaluation Settings. Table 3 summarizes the hyperparameters for training the dense and Mo E models." (Table 3 details Num. layers, Hidden size, Num. attention heads, Num. experts per layer, Num. parameters, Context/sequence length, Training tokens, Batch size, Learning rate, etc.) |