Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Think Only When You Need with Large Hybrid-Reasoning Models

Authors: Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, Furu Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experimental results and human studies conducted on Qwen-2.5 series models ranging from 1.5B to 7B parameters across multiple domains (including mathematics, programming, and general tasks) demonstrate that our LHRMs effectively performs hybrid thinking by adapting to queries of varying difficulty and types.
Researcher Affiliation	Collaboration	Lingjie Jiang Xun Wu Shaohan Huang Qingxiu Dong Zewen Chi Li Dong Xingxing Zhang Tengchao Lv Lei Cui Furu Wei Microsoft Research Peking University EMAIL; EMAIL
Pseudocode	Yes	Algorithm 1 Hybrid Group Policy Optimization Input model trained at Stage I πθHFT; reward models Rϕ; queries P; hyperparameters ϵ, β, µ
Open Source Code	Yes	Our training data is constructed by combining multiple publicly available datasets, with all details provided in Appendix C. The training code will be included in the supplementary materials.
Open Datasets	Yes	The think-style subset includes high-quality math, code, and science questions sourced from existing datasets [35, 12, 39, 45, 50], with answers generated by Deepseek-R1 [15] and verified for correctness. For the non-think-style subset, we collect simple queries from Wild Chat-1M [58] using a Fast Text-based classifier [21] to exclude complex reasoning tasks.
Dataset Splits	No	After deduplication and the removal of overlaps with evaluation benchmarks, we obtain a final set of 1.7M hybrid-formatted training examples.
Hardware Specification	Yes	Training the 7B model in the SFT phase takes approximately 2.5 days on 4 nodes of NVIDIA 8 H100 stations. ... The RL phase takes 2 days on NVIDIA 4 H100 Stations.
Software Dependencies	No	We use Ve RL [43] to conduct experiments. ... For implementations, we use LLa MA-Factory [60] 5 as the codebase for both DPO and RFT.
Experiment Setup	Yes	All models are trained for 3 epochs with the Adam W optimizer, employing a 10% linear warmup followed by a cosine learning rate decay schedule. The maximum learning rate is set to 1e 4, with a batch size of 128 and a maximum sequence length of 32k tokens. ... By default, we use a constant 1 10 6 learning rate together with Adam W optimizer for policy model, and use a batch size of 256 and micro batchsize of 8. The rollout stage collects 256 prompts and samples 4 responses for each prompt. We set α = 1.0 and margin = 0.2 for RL training. We set KL coefficient to 0.001 and ϵ = 0.5 in Eq. 11 in all experiments.