Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Mind the Gap: Bridging Thought Leap for Improved Chain-of-Thought Tuning

Authors: Haolei Xu, Yuchen Yan, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Shengpei Jiang, Kaitao Song, Weiming Lu, Jun Xiao, Yueting Zhuang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through comprehensive experiments on mathematical reasoning benchmarks, we demonstrate that models fine-tuned on bridged datasets consistently outperform those trained on original datasets, with improvements of up to +5.87% on Numina Math. Our approach effectively enhances distilled data (+3.02%) and provides better starting points for reinforcement learning (+3.1%), functioning as a plug-and-play module compatible with existing optimization techniques.
Researcher Affiliation	Collaboration	Haolei Xu1 Yuchen Yan1 Yongliang Shen1 Wenqi Zhang1 Guiyang Hou1 Shengpei Jiang2 Kaitao Song3 Weiming Lu1 Jun Xiao1 Yueting Zhuang1 1 Zhejiang University 2 SF Technology 3 Microsoft Research Asia
Pseudocode	No	The paper describes a task formalization and a data augmentation process, but it does not include a clearly labeled pseudocode or algorithm block detailing the method's steps in a structured format like code.
Open Source Code	No	Project: https://zju-real.github.io/Co T-Bridge. In the NeurIPS Paper Checklist, the authors state: 'We will release the code once accepted.'
Open Datasets	Yes	We constructed a specialized training dataset called Scale QM+, based on the structured Scale Quest Math dataset [19]... We apply Co T-Bridge to enhance existing mathematical reasoning datasets, specifically Meta Math QA and Numina Math-Co T... Numina Math-Co T [14] compiles data from examinations, competitions, and Q&A communities, resulting in a dataset of 860k problem-solution pairs. Hugging Face repository, 13:9, 2024.
Dataset Splits	Yes	This process yields 588k training samples with 10k examples held out for testing. We employed six benchmarks: GSM8K [15], MATH500 [22], and Gao Kao2023EN [23] as basiclevel benchmarks, and Math Odyssey [24], Olympiad Bench EN [25], and AMC23 [26] as advanced competition-level benchmarks.
Hardware Specification	Yes	All SFT experiments were conducted using 8 Ascend H910B-64G. All model evaluations were performed using 4 NVIDIA A100-40G GPUs.
Software Dependencies	No	The paper mentions using Llama Factory, vLLM, Math-Verify, Open R1, and veRL training framework, and the GRPO algorithm, but does not provide specific version numbers for these software components.
Experiment Setup	Yes	The initial learning rate was set to 1 10 5 with a warm-up ratio of 0.1, and cosine scheduling was used to gradually reduce the learning rate to zero. The maximum sequence length was set to 8192 tokens, with a global batch size of 128. We trained models for 3 epochs on Meta Math QA [12] and for 2 epochs on Numina Math-Co T [14]... All RL experiments were conducted using the ve RL [62] training framework. We employed Math-Verify for answer verification. We set the initial learning rate to 1 10 6 and used a global batch size of 512. The maximum response length was limited to 4096 tokens. During rollout, 4 samples were generated per input. KL regularization was disabled, and evaluation was performed with zero temperature every 10 epochs on MATH500.