Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SPRINT: Enabling Interleaved Planning and Parallelized Execution in Reasoning Models

Authors: Emil Biju, Shayan Talaei, Zhemin Huang, Mohammadreza Pourreza, Azalia Mirhoseini, Amin Saberi

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive evaluations, we demonstrate that models fine-tuned with the SPRINT framework match the performance of reasoning models on complex domains such as mathematics while generating up to 39% fewer sequential tokens on problems requiring more than 8,000 output tokens. Finally, we observe consistent results transferred to two out-of-distribution tasks, namely GPQA and Countdown, with up to 45% and 65% reduction in average sequential tokens respectively for longer reasoning trajectories, while matching the performance of the fine-tuned reasoning model.
Researcher Affiliation	Collaboration	1Stanford University 2Microsoft 3Google
Pseudocode	No	The paper describes methodologies and pipelines using descriptive text and flowcharts (e.g., Figure 2 for the training pipeline) rather than formal algorithmic notation, and does not contain explicit pseudocode or algorithm blocks.
Open Source Code	Yes	4We open-source our code and datasets at this repository.
Open Datasets	Yes	To train our models, we begin with 6,000 reasoning trajectories from Deep Seek-R1 [3] generated on the training set of the MATH dataset [30], as released by [31]. After filtering these trajectories for correctness of the final answers and processing them through our data curation pipeline (Section 3.2), we obtain a curated set of approximately 1,700 samples for training. For evaluation, we primarily use the MATH-500 benchmark [32], a widely recognized test set consisting of 500 mathematical reasoning problems. To further examine the generalization capabilities of SPRINT to more challenging and out-of-distribution scenarios, we evaluate its performance against strong baseline models on two additional benchmarks. First, we evaluate on GPQA-diamond [10], a dataset from entirely different scientific domains, including biology, physics, and chemistry, thus assessing cross-domain reasoning robustness. Moreover, following [29, 8], we test SPRINT on a subset of 1000 samples from Countdown [8], a synthetic numerical reasoning task in which models must derive a target number from four provided numbers using arithmetic operations (+, , , ).
Dataset Splits	Yes	After filtering these trajectories for correctness of the final answers and processing them through our data curation pipeline (Section 3.2), we obtain a curated set of approximately 1,700 samples for training. For evaluation, we primarily use the MATH-500 benchmark [32], a widely recognized test set consisting of 500 mathematical reasoning problems. To further examine the generalization capabilities of SPRINT to more challenging and out-of-distribution scenarios, we evaluate its performance against strong baseline models on two additional benchmarks. First, we evaluate on GPQA-diamond [10], a dataset from entirely different scientific domains, including biology, physics, and chemistry, thus assessing cross-domain reasoning robustness. Moreover, following [29, 8], we test SPRINT on a subset of 1000 samples from Countdown [8], a synthetic numerical reasoning task in which models must derive a target number from four provided numbers using arithmetic operations (+, , , ).
Hardware Specification	Yes	Fine-tuning was primarily executed on a single machine with eight NVIDIA A100 GPUs with 40 GB memory per GPU. We use the ms-swift framework [41], a fine-tuning toolkit provided by the Modelscope community. Each model is fine-tuned for 5 epochs. Due to the long-context required for reasoning traces and the memory constraints, we use a batch size of 1 during the training. We use bfloat16 precision, an initial learning rate of 1 10 5, and a weight decay factor of 1 10 4. The learning rate scheduling consists of a linear warm-up phase during the first 5% of training steps, subsequently followed by linear decay to zero over the remaining training iterations. Model evaluation is conducted every 100 steps, and the best-performing model based on evaluation loss is retained. To optimize memory usage during training, we integrate several efficiency strategies, notably the Deep Speed Ze RO Redundancy Optimizer [42, 43] and 4-bit quantization. Deep Speed s Ze RO optimizer offers a set of memory-partitioning strategies that trade off memory savings against communication overhead. In many workloads, Ze RO Stage 1 or 2 strikes the best balance between memory efficiency and communication cost; however, since we need to train on long sequences, our per-GPU memory demands exceed what those stages can support. Therefore, we adopted Ze RO Stage 3 to train with extended context lengths without OOM errors. For model evaluation, we leverage v LLM [44] to serve our models. Specifically, each 7B-scale model (SPRINT, RFT, and Deep Seek-R1-Distill-7B) is deployed on a single NVIDIA A100 GPU with 40 GB of memory.
Software Dependencies	No	The paper mentions several software components like "ms-swift framework [41]", "Deep Speed Ze RO Redundancy Optimizer [42, 43]", "v LLM [44]", "Math-Verify library alongside Sym Py". However, it does not provide specific version numbers for these software dependencies, which are required for a 'Yes' answer.
Experiment Setup	Yes	Each model is fine-tuned for 5 epochs. Due to the long-context required for reasoning traces and the memory constraints, we use a batch size of 1 during the training. We use bfloat16 precision, an initial learning rate of 1 10 5, and a weight decay factor of 1 10 4. The learning rate scheduling consists of a linear warm-up phase during the first 5% of training steps, subsequently followed by linear decay to zero over the remaining training iterations. Model evaluation is conducted every 100 steps, and the best-performing model based on evaluation loss is retained.