Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Scaling Laws for Upcycling Mixture-of-Experts Language Models

Authors: Seng Pei Liew, Takuya Kato, Sho Takase

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets.
Researcher Affiliation	Industry	1SB Intuitions, Tokyo, Japan. Correspondence to: Seng Pei Liew <EMAIL>.
Pseudocode	No	The paper describes the routing mechanism and auxiliary loss using mathematical formulas (e.g., Equation 15) within the text, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code and data (cross-entropy losses) for analyses of the paper is available at https://github.com/sbintuitions/sparse-upcycling-scaling-laws.
Open Datasets	Yes	We use training dataset derived from the Common Crawl portion of Slimpajama-DC (Shen et al., 2023), containing 368B tokens in total. The test loss is calculated from the default validation set (0.3B tokens) defined therein. In Appendix B, we train models on two different datasets (Japanese language and source code datasets) to show that the scaling behavior generalizes across datasets.
Dataset Splits	Yes	The test loss is calculated from the default validation set (0.3B tokens) defined therein.
Hardware Specification	Yes	Our experiments are performed on multiple nodes, each consisting of 8 NVIDIA H100 80 GB GPUs, interconnected via Infini Band HDR.
Software Dependencies	Yes	We use and modify the Megatron-LM (core v0.8.0) library for our experiments4. Models are trained with data type bfloat16. Except for the largest Mo E we train (8x1B), which has tensor parallelism configured to be 2, all models are trained with data and sequence parallelisms only (Korthikanti et al., 2023). Other optimization libraries used include Flash Attention (Dao et al., 2022) and Transformer Engine5.
Experiment Setup	Yes	The common setup of training is shown in Table 4, and the model-dependent setup (warmup iteration, standard deviation of the normal distribution for initializing weights, maximum iteration run, battch size, tuned LR) is shown in Table 5. As described in the main text, we use the WSD schedule for training. The number of warmup steps of the WSD LR schedule is set to be roughly the same as the total model size (Porian et al., 2024). Linear decay to 10% of the maximum LR value is used in the last stage of the schedule, with the length set to be around 10% of the training length, following H agele et al. (2024).