Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Think Only When You Need with Large Hybrid-Reasoning Models
Authors: Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, Furu Wei
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experimental results and human studies conducted on Qwen-2.5 series models ranging from 1.5B to 7B parameters across multiple domains (including mathematics, programming, and general tasks) demonstrate that our LHRMs effectively performs hybrid thinking by adapting to queries of varying difficulty and types. |
| Researcher Affiliation | Collaboration | Lingjie Jiang Xun Wu Shaohan Huang Qingxiu Dong Zewen Chi Li Dong Xingxing Zhang Tengchao Lv Lei Cui Furu Wei Microsoft Research Peking University EMAIL; EMAIL |
| Pseudocode | Yes | Algorithm 1 Hybrid Group Policy Optimization Input model trained at Stage I πθHFT; reward models Rϕ; queries P; hyperparameters ϵ, β, µ |
| Open Source Code | Yes | Our training data is constructed by combining multiple publicly available datasets, with all details provided in Appendix C. The training code will be included in the supplementary materials. |
| Open Datasets | Yes | The think-style subset includes high-quality math, code, and science questions sourced from existing datasets [35, 12, 39, 45, 50], with answers generated by Deepseek-R1 [15] and verified for correctness. For the non-think-style subset, we collect simple queries from Wild Chat-1M [58] using a Fast Text-based classifier [21] to exclude complex reasoning tasks. |
| Dataset Splits | No | After deduplication and the removal of overlaps with evaluation benchmarks, we obtain a final set of 1.7M hybrid-formatted training examples. |
| Hardware Specification | Yes | Training the 7B model in the SFT phase takes approximately 2.5 days on 4 nodes of NVIDIA 8 H100 stations. ... The RL phase takes 2 days on NVIDIA 4 H100 Stations. |
| Software Dependencies | No | We use Ve RL [43] to conduct experiments. ... For implementations, we use LLa MA-Factory [60] 5 as the codebase for both DPO and RFT. |
| Experiment Setup | Yes | All models are trained for 3 epochs with the Adam W optimizer, employing a 10% linear warmup followed by a cosine learning rate decay schedule. The maximum learning rate is set to 1e 4, with a batch size of 128 and a maximum sequence length of 32k tokens. ... By default, we use a constant 1 10 6 learning rate together with Adam W optimizer for policy model, and use a batch size of 256 and micro batchsize of 8. The rollout stage collects 256 prompts and samples 4 responses for each prompt. We set α = 1.0 and margin = 0.2 for RL training. We set KL coefficient to 0.001 and ϵ = 0.5 in Eq. 11 in all experiments. |