Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
DLoFT: Gradient-Decoupled Fine-Tuning for Generalizable Long Chain-of-Thought Reasoning
Authors: Sitong Wu, Haoru Tan, Jingyao Li, Shaofeng Zhang, Xiaojuan Qi, Bei Yu, Jiaya Jia
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that our DLo FT significantly improves the generalization behavior of Long Co T abilities compared to SFT while maintaining strong in-distribution performance. The code is available at https://github.com/dvlab-research/DLo FT. |
| Researcher Affiliation | Academia | 1The Chinese University of Hong Kong 2The University of Hong Kong 3Shanghai Jiao Tong University 4The Hong Kong University of Science and Technology |
| Pseudocode | Yes | The overall algorithm is outlined in Algorithm 1. Algorithm 1: Decoupled Long Co T Fine-Tuning (DLo FT) Input: Training dataset D, initial model parameters θ, learning rate η, maximum iterations T , batch size B for t = 1 to T do Sample a mini-batch {(Pi, Ti, Si)}B i=1 from D, where Pi : the i-th problem Ti : the exploratory thinking process in Long Co T response for Pi Si : the deterministic solution in Long Co T response for Pi ; Step 1: Compute the Full Response (Ti Si) Gradient // : concatenation Lfull = 1 B PB i=1 P|Ti Si| k=1 log pθ (Ti Si)k | (Ti Si)<k, Pi ; // NLL loss gfull = θLfull; // compute gradient Step 2: Gradient Decoupling Compute the reference gradient (w.r.t. problem-specific information): Lref = 1 B PB i=1 P|Si| k=1 log pθ Sk i | S<k i , Pi ; // NLL loss with Si as target gref = θLref; // compute gradient Decouple the content-relevant gradient (w.r.t. problem-specific information) from gfull: gcon = gfull,gref gref 2 gref; // projection Decouple the paradigm-relevant gradient (w.r.t. Long Co T reasoning paradigm) from gfull: gpar = gfull gcon; // orthogonalization Step 3: Update Model Parameters θ θ η gpar; // use decoupled paradigm-relevant gradient to update model return θ; |
| Open Source Code | Yes | The code is available at https://github.com/dvlab-research/DLo FT. |
| Open Datasets | Yes | Our experiments utilize two types of Long Co T datasets that differ in coverage: Mixed-Domains Dataset. We adopt the well-known s1K [7] dataset... Single-Domain Dataset. We use open-source datasets that focus on three major domains: math, code, and medicine. Specifically, we take the Open R1-Math-220K dataset [12] for mathematics, the programming-related subset from Open Thoughts-114K dataset [37] for code, and the Medicalo1 dataset [38] for medicine. |
| Dataset Splits | No | Our experiments utilize two types of Long Co T datasets that differ in coverage: Mixed-Domains Dataset. We adopt the well-known s1K [7] dataset... Single-Domain Dataset. We use open-source datasets that focus on three major domains: math, code, and medicine. Specifically, we take the Open R1-Math-220K dataset [12] for mathematics, the programming-related subset from Open Thoughts-114K dataset [37] for code, and the Medicalo1 dataset [38] for medicine. Prior studies [39, 7] have revealed that learning Long Co T reasoning capabilities does not require hundreds of thousands of data, because it focuses on changing the reasoning paradigm rather than memorizing a large amount of knowledge. Therefore, we randomly sample 5K data from each of the above three datasets for our training, denoted as Open R1-Math-5K, Open Thoughts-Code-5K, and Medical-o1-5K, respectively. |
| Hardware Specification | No | The experiments are conducted on Qwen2.5-7B-Instruct [11] model with s1K [7] dataset. We report the average running time of each training step and the average GPU memory cost throughout the training. |
| Software Dependencies | No | The models are trained for 10 epochs using the Adam W optimizer with weight decay of zero. The learning rate is first increased from 0 to 1e-5 with a warm-up ratio of 0.03, and then decreased following a cosine decay schedule. We set the batch size as 16 when training on s1K [7] dataset, and 64 for other training datasets. During the RL training stage, we use the recent popular GRPO algorithm [40] with a KL penalty coefficient of 0.001. The batch size is set to 128, and the learning rate is set to 1e-6 keeping constant throughout the training. |
| Experiment Setup | Yes | The models are trained for 10 epochs using the Adam W optimizer with weight decay of zero. The learning rate is first increased from 0 to 1e-5 with a warm-up ratio of 0.03, and then decreased following a cosine decay schedule. We set the batch size as 16 when training on s1K [7] dataset, and 64 for other training datasets. During the RL training stage, we use the recent popular GRPO algorithm [40] with a KL penalty coefficient of 0.001. The batch size is set to 128, and the learning rate is set to 1e-6 keeping constant throughout the training. We train the model for 3 epochs, because it is found that the reward curve converges after about 3 epochs. |