Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought

Authors: Guanghao Li, Wenhao Jiang, Mingfeng Chen, Yan Li, Hao Yu, Shuting Dong, Tao Ren, Ming Tang, Chun Yuan

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1.8% gains under fine-tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow Co T as a scalable framework for enhancing reasoning in LLMs.
Researcher Affiliation	Academia	1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Southern University of Science and Technology 3Guangdong Laboratory of AI and Digital Economy (SZ) 4The Hong Kong University of Science and Technology 5Guanghua School of Management, Peking University EMAIL EMAIL EMAIL EMAIL
Pseudocode	No	The paper describes the methodology in prose and figures, but does not include any explicitly labeled pseudocode or algorithm blocks. The architectural components and training steps are explained textually.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: All datasets used in this work are publicly available. Our model training follows prior works, and the detailed implementation and training settings are provided in Section 4 and the Appendix.
Open Datasets	Yes	The mixed dataset includes data from: (1) Alpaca GPT4 [45] and Alpaca Co T [46] (general instructionfollowing and chain-of-thought reasoning), (2) Wiki QA [47] (open-domain question answering), (3) Code Alpaca [48] (code generation), and (4) Math Instruct [14] (multi-step mathematical reasoning). For distillation, we apply the Adaptive Kullback-Leibler (AKL) [49] method to align student outputs with teacher distributions. To assess the performance of the fine-tuned model, we evaluate it using the lm-evaluation-harness framework [50] across a broad set of benchmarks, categorized into four areas: (i) commonsense QA, including ARC-easy, ARC-challenge [51], Open Book QA [52], and Truthful QA [53]; (ii) multi-step reasoning, such as GSM8K [54] and MMLU [55]; (iii) reading comprehension and dialogue, with Co QA [56] and GLUE [57]; and (iv) code generation, using MBPP [58]. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: All datasets used in this work are publicly available. Our model training follows prior works, and the detailed implementation and training settings are provided in Section 4 and the Appendix.
Dataset Splits	Yes	Table 3 summarizes the five instruction-tuning corpora used for SCOUT. We follow the official splits and concatenate them after normalizing to a conversational format compatible with Qwen2.5.
Hardware Specification	Yes	Hardware. All experiments are conducted on a single NVIDIA H20 NVLink GPU (96 GB) attached to a dual socket server with 20 CPU cores (Intel Xeon Platinum 8457C) and 200 GB RAM.
Software Dependencies	No	We fine-tune the model for 2 epochs with a learning rate of 2 10 5 using the Llama Factory framework [44]. We utilize torch native gradient accumulation to emulate a global batch size of 128 sequences. The paper mentions 'Llama Factory framework' and 'torch native gradient accumulation' but does not specify version numbers for these software components or PyTorch itself.
Experiment Setup	Yes	Training configuration. We fine-tune the model for 2 epochs with a learning rate of 2 10 5 using the Llama Factory framework [44]. ... For distillation, we apply the Adaptive Kullback-Leibler (AKL) [49] method to align student outputs with teacher distributions. Unless otherwise specified, we set λt = 1/3 for all t, assigning equal weight to each reasoning iteration. Optimization hyper-parameters. Below we detail the learning-rate schedule and other relevant knobs. Fine-tuning lasts for 2 epochs with a learning rate of lrpre = 2 10 5 applied to all pretrained parameters. Newly introduced parameters (e.g., cross-attention projection, FC adapters, layer-norm gates, etc.) are trained with a higher learning rate lrnew = 2lrpre to accelerate adaptation while minimizing residual drift. We employ a cosine learning rate schedule with a warm-up ratio of 10%. Training is performed using bf16 precision. Distillation loss. Unless stated otherwise, the per-iteration objective is defined as L(t) = KL q(t) p(t) θ + α L(t) hard, (8) where we use α = 0.5.