Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Scaling Laws for Upcycling Mixture-of-Experts Language Models
Authors: Seng Pei Liew, Takuya Kato, Sho Takase
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we identify empirical scaling laws that describe how performance depends on dataset size and model configuration. Particularly, we show that, while scaling these factors improves performance, there is a novel interaction term between the dense and upcycled training dataset that limits the efficiency of upcycling at large computational budgets. |
| Researcher Affiliation | Industry | 1SB Intuitions, Tokyo, Japan. Correspondence to: Seng Pei Liew <EMAIL>. |
| Pseudocode | No | The paper describes the routing mechanism and auxiliary loss using mathematical formulas (e.g., Equation 15) within the text, but it does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and data (cross-entropy losses) for analyses of the paper is available at https://github.com/sbintuitions/sparse-upcycling-scaling-laws. |
| Open Datasets | Yes | We use training dataset derived from the Common Crawl portion of Slimpajama-DC (Shen et al., 2023), containing 368B tokens in total. The test loss is calculated from the default validation set (0.3B tokens) defined therein. In Appendix B, we train models on two different datasets (Japanese language and source code datasets) to show that the scaling behavior generalizes across datasets. |
| Dataset Splits | Yes | The test loss is calculated from the default validation set (0.3B tokens) defined therein. |
| Hardware Specification | Yes | Our experiments are performed on multiple nodes, each consisting of 8 NVIDIA H100 80 GB GPUs, interconnected via Infini Band HDR. |
| Software Dependencies | Yes | We use and modify the Megatron-LM (core v0.8.0) library for our experiments4. Models are trained with data type bfloat16. Except for the largest Mo E we train (8x1B), which has tensor parallelism configured to be 2, all models are trained with data and sequence parallelisms only (Korthikanti et al., 2023). Other optimization libraries used include Flash Attention (Dao et al., 2022) and Transformer Engine5. |
| Experiment Setup | Yes | The common setup of training is shown in Table 4, and the model-dependent setup (warmup iteration, standard deviation of the normal distribution for initializing weights, maximum iteration run, battch size, tuned LR) is shown in Table 5. As described in the main text, we use the WSD schedule for training. The number of warmup steps of the WSD LR schedule is set to be roughly the same as the total model size (Porian et al., 2024). Linear decay to 10% of the maximum LR value is used in the last stage of the schedule, with the length set to be around 10% of the training length, following H agele et al. (2024). |