Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
Authors: Xiaoxue Cheng, Junyi Li, Zhenduo Zhang, Xinyu Tang, Xin Zhao, Xinyu Kong, Zhiqiang Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning. We evaluate ACPO on a range of complex reasoning benchmarks. |
| Researcher Affiliation | Collaboration | 1 Gaoling School of Artificial Intelligence, Renmin University of China 4 Department of Data Science, City University of Hong Kong 5 Ant Group |
| Pseudocode | No | The paper describes the Adaptive Cognition Policy Optimization (ACPO) framework and its two-stage training strategy in detail, along with reward design and components, but it does not include a formal pseudocode block or algorithm. |
| Open Source Code | No | We provide all the necessary code to reproduce every experiment mentioned in the paper. We will also release our code publicly upon publication. |
| Open Datasets | Yes | We conduct training on the Deep Scale R-Preview-Dataset [26], a mathematical dataset consisting of 40K question-answer pairs drawn from AIME, AMC, Omni Math [27] and STILL [28]. For evaluation, we assess model performance on three mathematical datasets: GSM8K [29], AIME 2024, and MATH 500 [30]. Appendix C.2 Datasets: The GSM8K [29] dataset: MIT License. The MATH [30] dataset: MIT License. The Deep Scale R-Preview-Dataset [26]: MIT License. |
| Dataset Splits | No | The paper mentions using a dataset containing 745 annotated samples for supervised fine-tuning and the Deep Scale R-Preview-Dataset for training. For evaluation, it uses GSM8K, AIME 2024, and MATH 500. However, it does not explicitly provide specific training/test/validation splits (e.g., percentages or exact counts) for any of these datasets. |
| Hardware Specification | Yes | All the experiments are conducted on 16 NVIDIA A100 GPUs. |
| Software Dependencies | No | The models are trained for one epoch, and both the SFT and RL stages are conducted using the Ve RL framework [33]. We use Qwen2.5 [38] tokenizer to calculate the number of tokens in the responses generated by each model for a fair comparison. |
| Experiment Setup | Yes | In the cold start phase, we fine-tune the models for 3 epochs using 745 annotated samples with explicit reasoning tokens. For ACPO training, we adopt the same hyperparameter settings as used in Deep Scale R-1.5B-Preview. Specifically, we use a learning rate of 1 10 6, a batch size of 128, and a maximum context length of 8K tokens during training. The models are trained for one epoch, and both the SFT and RL stages are conducted using the Ve RL framework [33]. We set pthresh = 0.5 in Eq. 6 and set the reward weights as wacc = 0.6, wlen = 0.3, and wthink = 0.1 in Eq. 7. |