Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

EvoLM: In Search of Lost Training Dynamics for Language Model Reasoning

Authors: Zhenting Qi, Fan Nie, Alexandre Alahi, James Y Zou, Himabindu Lakkaraju, Yilun Du, Eric P Xing, Sham Kakade, Hanlin Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We train over 100 LMs with 1B and 4B parameters from scratch, and evaluate both upstream (language modeling) and downstream (problem-solving) capabilities, including considerations of both in-domain and out-of-domain generalization. Systematic analyses of language model capabilities across their entire lifecycle from pre-training to RL post-training with evaluation on reasoning-intensive upstream cloze tasks and downstream generative tasks, considering both in-domain and out-of-domain generalization.
Researcher Affiliation	Academia	Zhenting Qi1 Fan Nie2 Alexandre Alahi3 James Zou2 Himabindu Lakkaraju1 Yilun Du1 Eric Xing4 Sham Kakade1 Hanlin Zhang1 1Harvard 2Stanford 3EPFL 4CMU
Pseudocode	No	No pseudocode or algorithm block is explicitly labeled or presented in a structured format in the main body of the paper.
Open Source Code	Yes	To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline. Open-sourcing a comprehensive, transparent, and reproducible training pipeline and evaluation framework, facilitating further research into scaling laws, training dynamics, and evaluating upstream and downstream capabilities of language models.
Open Datasets	Yes	To facilitate open research and reproducibility, we release all pre-trained and post-trained models, training datasets for all stages, and our entire training and evaluation pipeline. Pre-training: Conducted on Fine Web-Edu [38]. Continued Pre-training (CPT): Performed on Fine Math [2] with token budgets from 2B to 42B. Supervised Fine-Tuning (SFT): Applied to a dataset of QA pairs augmented from GSM8K [12] and MATH [20], collected from a mixture of Meta Math QA [63], Open Math Instruct2 [55], and Numina Math [31].
Dataset Splits	Yes	Supervised Fine-Tuning (SFT): Applied to a dataset of QA pairs augmented from GSM8K [12] and MATH [20], collected from a mixture of Meta Math QA [63], Open Math Instruct2 [55], and Numina Math [31]. We filter out low-quality prompts using model correctness consistency [39], discarding samples with zero inter-model consensus. Reinforcement Learning (RL): Conducted using Proximal Policy Optimization (PPO) [47], with a binary verifiable reward. The RL stage uses the same data sources as SFT but ensures no overlap with the SFT dataset. Varying SFT dataset size. ... varying the number of SFT examples from 50K to 400K, holding epochs fixed at one... Varying RL dataset size. ... vary the RL dataset size from 0 to 400K examples. Figure 9: Downstream task performance for {1B, 4B}-160BT-8+42BT-{10K, ..., 90K}ep4-{90K, ..., 10K}ep4. ... The total number of posttraining samples is fixed at 100K.
Hardware Specification	Yes	We use the Adam W optimizer and up to 32 NVIDIA H100 80GB HBM3 GPUs for all training stages.
Software Dependencies	No	In this work, we establish an end-to-end development pipeline using open toolkits [1, 71, 49] and open data sources [38, 63, 55, 31] to systematically and transparently investigate language models reasoning capabilities throughout their lifecycle... We use the v LLM framework [29] for inference.
Experiment Setup	Yes	Hyperparameters for pretraining/continued pretraining, SFT, and RL are shown in Table 6, Table 7, Table 8, respectively. We use the Adam W optimizer and up to 32 NVIDIA H100 80GB HBM3 GPUs for all training stages. For pretraining, continued pretraining, and SFT, we use a standard warmup-cosine-decay strategy for the learning rate schedule. For RL, we apply a warmup-constant learning rate schedule.