Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

Authors: Wanlong Liu, Junxiao Xu, Fei Richard Yu, Yukang Lin, Ke Ji, Wenyu Chen, Lifeng Shang, Yasheng Wang, Yan Xu, Benyou Wang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50%, while achieving performance comparable to Supervised Fine Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.
Researcher Affiliation	Collaboration	1 University of Electronic Science and Technology of China, Chengdu, China 2 The Chinese University of Hong Kong, Shenzhen 3 Huawei Noah s Ark Lab
Pseudocode	No	No explicit section or figure titled "Pseudocode" or "Algorithm" is present. The methodology is described in prose and mathematical formulas, not structured algorithm blocks.
Open Source Code	Yes	QFFT is publicly available at https://github.com/LWL-cpu/Question-Free-Fine-Tuning.
Open Datasets	Yes	We select several high-quality distillation datasets, including S1.1 (1k) [9], LIMO (871) [10], and Bespoke-Stratos (17k) [8]... We evaluate model performance on six in-domain math datasets, including two simple datasets: GSM8K [28] and MATH500 [22], two medium-difficulty datasets: the American Mathematics Competitions (AMC23) and Minerva [29], which includes undergraduate-level STEM problems, and two high-difficulty datasets: AIME 2024 and AIME 2025. Additionally, we also evaluate on two out-of-domain non-math datasets (in Section B.2): GPQA [30] and MMLU-Pro [31]. Additionally, to investigate whether the removal of queries by QFFT increases model hallucinations, we further evaluate the models on LLM-Aggre Fact[35], a benchmark specifically designed for hallucination detection. We utilized the Process Bench [49] dataset, which contains detailed stepby-step solutions to mathematical problems produced by Short Co T models, with each step annotated as either correct or incorrect.
Dataset Splits	Yes	We select several high-quality distillation datasets, including S1.1 (1k) [9], LIMO (871) [10], and Bespoke-Stratos (17k) [8]... We evaluate model performance on six in-domain math datasets, including two simple datasets: GSM8K [28] and MATH500 [22], two medium-difficulty datasets: the American Mathematics Competitions (AMC23) and Minerva [29], which includes undergraduate-level STEM problems, and two high-difficulty datasets: AIME 2024 and AIME 2025. Additionally, we also evaluate on two out-of-domain non-math datasets (in Section B.2): GPQA [30] and MMLU-Pro [31]. For the Difficulty-Adaptive Distillation baseline, specific subsets are defined: "we identify the challenging subset Dhard (850 examples) by selecting questions from S1.1 that Qwen2.5-Instruct-7B fails to solve. For the simple subset Deasy, we combine questions correctly answered by Qwen2.5-Instruct-7B from S1.1 and randomly sampled questions from the GSM8K training set, totaling 850 examples."
Hardware Specification	No	The paper states: 'All experiments are conducted with LLa MA Factory [27], with a maximum sequence length of 16,384 tokens.' and 'We use the VLLM reasoning architecture, and the inference setup is aligned with LIMO.' It also mentions using 'Qwen2.5-Instruct-7B and Qwen2.5-Instruct-32B' and 'Phi4-mini-Instruct' as base models. However, it does not specify concrete hardware details such as GPU models, CPU types, or memory.
Software Dependencies	No	The paper states: 'All experiments are conducted with LLa MA Factory [27]'. While LLaMA Factory is a software component, a specific version number for it or other libraries/frameworks (like PyTorch, Transformers, CUDA, etc.) is not provided in the text, which is required for reproducibility.
Experiment Setup	Yes	Appendix E.2, Table 5 provides detailed hyperparameters used for training different models: Cutoff_len 16384, Batch_size 8-32, Learning_rate 1e-5, Epochs 6, Lr_scheduler_type Cosine, Weight_decay 1e-4, Warmup_ratio 0.1.