Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DLoFT: Gradient-Decoupled Fine-Tuning for Generalizable Long Chain-of-Thought Reasoning

Authors: Sitong Wu, Haoru Tan, Jingyao Li, Shaofeng Zhang, Xiaojuan Qi, Bei Yu, Jiaya Jia

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments demonstrate that our DLo FT significantly improves the generalization behavior of Long Co T abilities compared to SFT while maintaining strong in-distribution performance. The code is available at https://github.com/dvlab-research/DLo FT.
Researcher Affiliation	Academia	1The Chinese University of Hong Kong 2The University of Hong Kong 3Shanghai Jiao Tong University 4The Hong Kong University of Science and Technology
Pseudocode	Yes	The overall algorithm is outlined in Algorithm 1. Algorithm 1: Decoupled Long Co T Fine-Tuning (DLo FT) Input: Training dataset D, initial model parameters θ, learning rate η, maximum iterations T , batch size B for t = 1 to T do Sample a mini-batch {(Pi, Ti, Si)}B i=1 from D, where Pi : the i-th problem Ti : the exploratory thinking process in Long Co T response for Pi Si : the deterministic solution in Long Co T response for Pi ; Step 1: Compute the Full Response (Ti Si) Gradient // : concatenation Lfull = 1 B PB i=1 P\|Ti Si\| k=1 log pθ (Ti Si)k \| (Ti Si)<k, Pi ; // NLL loss gfull = θLfull; // compute gradient Step 2: Gradient Decoupling Compute the reference gradient (w.r.t. problem-specific information): Lref = 1 B PB i=1 P\|Si\| k=1 log pθ Sk i \| S<k i , Pi ; // NLL loss with Si as target gref = θLref; // compute gradient Decouple the content-relevant gradient (w.r.t. problem-specific information) from gfull: gcon = gfull,gref gref 2 gref; // projection Decouple the paradigm-relevant gradient (w.r.t. Long Co T reasoning paradigm) from gfull: gpar = gfull gcon; // orthogonalization Step 3: Update Model Parameters θ θ η gpar; // use decoupled paradigm-relevant gradient to update model return θ;
Open Source Code	Yes	The code is available at https://github.com/dvlab-research/DLo FT.
Open Datasets	Yes	Our experiments utilize two types of Long Co T datasets that differ in coverage: Mixed-Domains Dataset. We adopt the well-known s1K [7] dataset... Single-Domain Dataset. We use open-source datasets that focus on three major domains: math, code, and medicine. Specifically, we take the Open R1-Math-220K dataset [12] for mathematics, the programming-related subset from Open Thoughts-114K dataset [37] for code, and the Medicalo1 dataset [38] for medicine.
Dataset Splits	No	Our experiments utilize two types of Long Co T datasets that differ in coverage: Mixed-Domains Dataset. We adopt the well-known s1K [7] dataset... Single-Domain Dataset. We use open-source datasets that focus on three major domains: math, code, and medicine. Specifically, we take the Open R1-Math-220K dataset [12] for mathematics, the programming-related subset from Open Thoughts-114K dataset [37] for code, and the Medicalo1 dataset [38] for medicine. Prior studies [39, 7] have revealed that learning Long Co T reasoning capabilities does not require hundreds of thousands of data, because it focuses on changing the reasoning paradigm rather than memorizing a large amount of knowledge. Therefore, we randomly sample 5K data from each of the above three datasets for our training, denoted as Open R1-Math-5K, Open Thoughts-Code-5K, and Medical-o1-5K, respectively.
Hardware Specification	No	The experiments are conducted on Qwen2.5-7B-Instruct [11] model with s1K [7] dataset. We report the average running time of each training step and the average GPU memory cost throughout the training.
Software Dependencies	No	The models are trained for 10 epochs using the Adam W optimizer with weight decay of zero. The learning rate is first increased from 0 to 1e-5 with a warm-up ratio of 0.03, and then decreased following a cosine decay schedule. We set the batch size as 16 when training on s1K [7] dataset, and 64 for other training datasets. During the RL training stage, we use the recent popular GRPO algorithm [40] with a KL penalty coefficient of 0.001. The batch size is set to 128, and the learning rate is set to 1e-6 keeping constant throughout the training.
Experiment Setup	Yes	The models are trained for 10 epochs using the Adam W optimizer with weight decay of zero. The learning rate is first increased from 0 to 1e-5 with a warm-up ratio of 0.03, and then decreased following a cosine decay schedule. We set the batch size as 16 when training on s1K [7] dataset, and 64 for other training datasets. During the RL training stage, we use the recent popular GRPO algorithm [40] with a KL penalty coefficient of 0.001. The batch size is set to 128, and the learning rate is set to 1e-6 keeping constant throughout the training. We train the model for 3 epochs, because it is found that the reward curve converges after about 3 epochs.