Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Model Merging in Pre-training of Large Language Models
Authors: Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu, Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin, Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, LingJun Liu, Bole Ma, Xiaoying Jia, zhou Xun, liang xiang, Yonghui Wu
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present a comprehensive investigation of model merging techniques during the pre-training process. Through extensive experiments with both dense and Mixture-of-Experts (Mo E) architectures ranging from millions to over 100 billion parameters, we demonstrate that merging checkpoints trained with constant learning rates not only achieves significant performance improvements but also enables accurate prediction of annealing behavior. These improvements lead to both more efficient model development and significantly lower training costs. Our detailed ablation studies on merging strategies and hyperparameters provide new insights into the underlying mechanisms while uncovering novel applications. |
| Researcher Affiliation | Industry | Yunshui Li, Yiyuan Ma, Shen Yan, Chaoyi Zhang, Jing Liu, Jianqiao Lu, Ziwen Xu Mengzhao Chen, Minrui Wang, Shiyi Zhan, Jin Ma, Xunhao Lai, Yao Luo, Xingyan Bin Hongbin Ren, Mingji Han, Wenhao Hao, Bairen Yi, Lingjun Liu, Bole Ma, Xiaoying Jia Xun Zhou, Liang Xiang, Yonghui Wu Byte Dance Seed EMAIL |
| Pseudocode | No | The paper describes methods and mathematical formulations for model merging (e.g., SMA, EMA, WMA in Section 3 and Taylor expansion in Appendix C), but it does not present any structured pseudocode or algorithm blocks. The steps are described in narrative text. |
| Open Source Code | No | Through comprehensive experimental analysis, we offer the open-source community practical pre-training guidelines for effective model merging. (This statement indicates providing guidelines *to* the community, not releasing code for their own work.) Additionally, the NeurIPS checklist states for question 5 "Open access to data and code": "[No]" with the justification: "Although the code and data are not provided, it does not affect the main conclusions of the paper." |
| Open Datasets | No | We trained a diverse set of LLMs of varying sizes and architectures from scratch... employing optimal values for training on an internal pretraining corpus comprising trillions of tokens. Although specific model architectures and datasets have not yet been publicly released, we posit that our findings are not strongly tied to these particular choices, as subsequent experiments primarily focus on Mo E structures. |
| Dataset Splits | No | The paper mentions an "internal pretraining corpus comprising trillions of tokens" for training and lists various open-source benchmarks for evaluation. However, it does not specify any training/test/validation splits for its primary internal pretraining corpus. It only mentions using public benchmarks for evaluation without detailing how these were used in terms of splits. |
| Hardware Specification | Yes | We primarily used NVIDIA H-series GPUs for training and evaluation. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We employ a Warmup-Stable-Decay (WSD) learning rate scheduler Hu et al. [2024], which begins with a short warmup period, followed by an extended period of stable training at a constant learning rate, and concludes with annealing to a relatively small learning rate. The learning rates are determined according to scaling law guidelines Bi et al. [2024], Kaplan et al. [2020]... the optimal merging interval exhibits a clear scaling relationship with model size... an interval of around 8B tokens for 1.3B/13B models, 4B tokens for 0.7B/7B models, and approximately 80B tokens for 10B/100B models... We opted for N = 10 in further experiments... We conducted an ablation study to assess the sensitivity of the PMA-init of the CT stage with varying learning rate schedules... Seed-Mo E-0.7B/7B models... training a 330M/3.3B Mo E model from scratch using an exceptionally high learning rate of 6e-3... We conducted SFT training for 220M tokens using both the original weights and PMA-init weights. For the original weights, we used a cosine learning rate schedule with an initial learning rate of 2e-5 and an end learning rate of 2e-6. For the PMA-init weights, we used cosine schedules with initial learning rates of 1e-5, 2e-5, and 4e-5, all with an end learning rate of 2e-6. |