Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning

Authors: Xiaoxue Cheng, Junyi Li, Zhenduo Zhang, Xinyu Tang, Xin Zhao, Xinyu Kong, Zhiqiang Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning. We evaluate ACPO on a range of complex reasoning benchmarks.
Researcher Affiliation Collaboration 1 Gaoling School of Artificial Intelligence, Renmin University of China 4 Department of Data Science, City University of Hong Kong 5 Ant Group
Pseudocode No The paper describes the Adaptive Cognition Policy Optimization (ACPO) framework and its two-stage training strategy in detail, along with reward design and components, but it does not include a formal pseudocode block or algorithm.
Open Source Code No We provide all the necessary code to reproduce every experiment mentioned in the paper. We will also release our code publicly upon publication.
Open Datasets Yes We conduct training on the Deep Scale R-Preview-Dataset [26], a mathematical dataset consisting of 40K question-answer pairs drawn from AIME, AMC, Omni Math [27] and STILL [28]. For evaluation, we assess model performance on three mathematical datasets: GSM8K [29], AIME 2024, and MATH 500 [30]. Appendix C.2 Datasets: The GSM8K [29] dataset: MIT License. The MATH [30] dataset: MIT License. The Deep Scale R-Preview-Dataset [26]: MIT License.
Dataset Splits No The paper mentions using a dataset containing 745 annotated samples for supervised fine-tuning and the Deep Scale R-Preview-Dataset for training. For evaluation, it uses GSM8K, AIME 2024, and MATH 500. However, it does not explicitly provide specific training/test/validation splits (e.g., percentages or exact counts) for any of these datasets.
Hardware Specification Yes All the experiments are conducted on 16 NVIDIA A100 GPUs.
Software Dependencies No The models are trained for one epoch, and both the SFT and RL stages are conducted using the Ve RL framework [33]. We use Qwen2.5 [38] tokenizer to calculate the number of tokens in the responses generated by each model for a fair comparison.
Experiment Setup Yes In the cold start phase, we fine-tune the models for 3 epochs using 745 annotated samples with explicit reasoning tokens. For ACPO training, we adopt the same hyperparameter settings as used in Deep Scale R-1.5B-Preview. Specifically, we use a learning rate of 1 10 6, a batch size of 128, and a maximum context length of 8K tokens during training. The models are trained for one epoch, and both the SFT and RL stages are conducted using the Ve RL framework [33]. We set pthresh = 0.5 in Eq. 6 and set the reward weights as wacc = 0.6, wlen = 0.3, and wthink = 0.1 in Eq. 7.