Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Thinker: Learning to Think Fast and Slow

Authors: Stephen Chung, Wenyu Du, Jie Fu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This section details the experiments conducted to assess whether the Thinker task can more effectively enhance LLM reasoning capabilities compared to a standard QA task. We focus on the mathematical reasoning domain here. Experimental results validate our approach: relative to the QA task, the Thinker task yields consistent gains across diverse math benchmarks, with average relative performance gains of 6.7% for Qwen2.51.5B models and 11.1% for Deep Seek-R1-Distill-Qwen-1.5B models.
Researcher Affiliation	Collaboration	Stephen Chung University of Cambridge Wenyu Du The University of Hong Kong Jie Fu Shanghai AI Lab
Pseudocode	Yes	The Thinker task decomposes the response into a four-step process: 1. Fast Thinking: The agent generates an initial answer using a small token budget. 2. Verification: The agent evaluates the correctness of the initial answer using a small token budget. If verified, it is accepted as the final answer. 3. Slow Thinking: If the initial answer fails verification, the agent can produce another final answer, using a large token budget. 4. Summarization: The agent summarizes the reasoning from the slow thinking step into a concise summary that leads to the same final answer.
Open Source Code	Yes	Additionally, we have open-sourced both the trained models and the source code. Our implementation, adapted from the Open-Reasoner-Zero codebase [3], is publicly available at https://github.com/stephen-chung-mh/thinker-task, which also includes the trained models.
Open Datasets	Yes	For the training data, we utilized the 129K math question-answering dataset provided by Open Reasoner-Zero.
Dataset Splits	No	For the training data, we utilized the 129K math question-answering dataset provided by Open Reasoner-Zero. The performance of the final models is detailed in Table 1. These models correspond to the checkpoints, saved at 50-step intervals, that achieved the highest accuracy on a validation dataset. (The specific split percentages or counts for the training data into train/validation are not provided.)
Hardware Specification	Yes	Each training run for both the Thinker task and the baseline required approximately 7 days on two compute nodes, each equipped with 8 A100 GPUs.
Software Dependencies	No	We use the Deepspeed [27], v LLM [28], and Ray [29] library for distributed training. (No specific version numbers are given for these libraries.)
Experiment Setup	Yes	Key hyperparameters include a discount rate γ = 1, GAE lambda λ = 1, and a sampling temperature of 1. No KL penalty against a reference policy was applied. The Fast Thinking and Summarization stages use a 1000-token budget, Verification uses a 2000-token budget, and Slow Thinking uses a 6000-token budget. Most other hyperparameters and training details mirror those in Open-Reasoner-Zero, with details provided in Appendix A.