Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Dense Backpropagation Improves Training for Sparse Mixture-of-Experts

Authors: Ashwinee Panda, Vatsal Baherwani, Zain Sarwar, Benjamin Thรฉrien, Sambit Sahu, Tom Goldstein, Supriyo Chakraborty

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In Table 2 we compare Default Mo Es to Top K Mo Es. All models are trained for 160B tokens on Fine Web-Edu and have 1.96B params. For both finegrained Mo Es (32c4) and standard Mo Es (8c1), our Default Mo Es outperform Top K Mo Es on standard benchmarks. We conduct all evaluations with the lm-eval harness [Gao et al., 2024].
Researcher Affiliation Collaboration Ashwinee Panda1 Vatsal Baherwani1 Zain Sarwar2,3 Benjamin Therien2,4 Sambit Sahu2 Tom Goldstein1 Supriyo Chakraborty2 1University of Maryland 2Capital One 3University of Chicago 4 Mila Quebec AI Institute
Pseudocode No The paper describes methods and concepts using prose and diagrams (Figure 1), but does not contain a distinct section or figure explicitly labeled 'Pseudocode' or 'Algorithm', nor are there structured code-like blocks detailing a procedure.
Open Source Code Yes We have open-sourced our training code. All hyperparameters for the models trained in Table 2 can be found in our config file, and all hyperparameters are the same between the baseline and our Default Mo E. All further experimental details can be found in Appendix A.2.
Open Datasets Yes Dataset. We train on Fine Web-Edu [Lozhkov et al., 2024] (for Table 2 and Figure 11) and Fine Web [Penedo et al., 2024] (for all other plots and results) with the Llama3 tokenizer [Llama 3 Team, 2024].
Dataset Splits No The paper states, 'We train on Fine Web-Edu [Lozhkov et al., 2024] (for Table 2 and Figure 11) and Fine Web [Penedo et al., 2024] (for all other plots and results),' indicating the datasets used for pretraining. It also mentions 'We conduct all evaluations with the lm-eval harness [Gao et al., 2024]' for benchmarking. While performance is measured, the paper does not explicitly detail the specific training, validation, and test splits (e.g., percentages or sample counts) for the Fine Web-Edu or Fine Web datasets themselves, or how validation sets were derived from these for perplexity measurement.
Hardware Specification No The NeurIPS checklist states 'The LLM experiments were run on 64 GPUs on the AWS cluster,' but it does not specify the exact model of GPUs (e.g., NVIDIA A100, V100) or the specific AWS instance types used, which are required for a reproducible hardware specification.
Software Dependencies No The paper mentions software components and libraries like 'gpt-neox library [Andonian et al., 2023]', 'Megablocks [Gale et al., 2022]', and 'Triton kernels from [Hsu et al., 2024]' in Appendix A.2. However, it does not provide specific version numbers for these software dependencies, nor for other tools or frameworks used.
Experiment Setup Yes The paper provides extensive details on the experimental setup in Section 4.1 'Experimental Setup' and Appendix A.2 'Experimental Setup Details'. This includes model architectures (e.g., '1.96 billion total parameters', 'N=8, K=1'), dataset details, hyperparameters ('sequence length of 2048 and a global batch size of 1024'), optimizer ('Adam W optimizer'), training schedule ('standard cosine decay schedule'), and MoE-specific settings ('auxiliary loss to 0.01').