Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
When Can Model-Free Reinforcement Learning be Enough for Thinking?
Authors: Josiah Hanna, Nicholas Corrado
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To test our hypothesis, we take different pre-trained LLMs and apply them to add series of five four-digit numbers... Table 1 shows the average accuracy for each model. For all models, we see that appending the thinking tokens increases accuracy... In Figure 2 we see that Pretrained-Think learns significantly faster than the other agents. |
| Researcher Affiliation | Academia | Josiah P. Hanna Computer Sciences Department University of Wisconsin Madison EMAIL Nicholas E. Corrado Computer Sciences Department University of Wisconsin Madison EMAIL |
| Pseudocode | No | The paper describes algorithms like policy iteration and REINFORCE but does not provide them in pseudocode blocks or algorithms. There are no figures or sections explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Code for these experiments is available at https://github.com/prediction-action-lab/thinking-as-control. We have included the code for the submission in the supplemental material. |
| Open Datasets | No | No external data sources are used. |
| Dataset Splits | No | For the LLM experiment, the paper describes constructing prompts (e.g., "We constuct 1000 Thinking" and 1000 No Thinking" prompts like these using 1000 different sequences of four 4-digit integers"), rather than using predefined dataset splits. For the Gridworld experiment, it mentions "We run 20 trials for each learning agent" but this refers to experimental runs, not dataset splits. |
| Hardware Specification | Yes | All experiments are ran on a Macbook Air with an Apple M1 chip and 16GB of memory. |
| Software Dependencies | Yes | We implement the Gridworld domain, pre-training, and reinforcement learning set-up in Python, using Pytorch [Paszke et al., 2019] for neural networks and gradient optimization. All experiments are ran on a Macbook Air with an Apple M1 chip and 16GB of memory. For both model pre-training and RL with REINFORCE we use the Adam optimizer [Kingma and Ba, 2015] with learning rates 1e-4 and 1e-5, respectively. |
| Experiment Setup | Yes | For both model pre-training and RL with REINFORCE we use the Adam optimizer [Kingma and Ba, 2015] with learning rates 1e-4 and 1e-5, respectively. For REINFORCE, we do not use a value function baseline as we found it generally did not help because the sparseness of the reward led to poor value estimates that harmed early policy learning. For the "No Think" methods, we mask out the special actions by adding a large negative value to the logits for those actions before passing them to the softmax distribution. The horizontal axis is the iteration of policy improvement (the agent collects 200 episodes at each iteration). |