Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reasoning Models Sometimes Output Illegible Chains of Thought
Authors: Arun Jose
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we study whether outcome-based RL causes reasoning models to do meaningful reasoning in illegible Co T. We evaluate the legibility of 14 models, including Deep Seek R1 and its distills, R1-Zero, Qw Q, Qwen3, Kimi K2, and various Claude models when reasoning about difficult scientific questions [Rein et al., 2023], and score their outputs on legibility using GPT-4o. |
| Researcher Affiliation | Academia | Arun Jose Independent |
| Pseudocode | No | The paper describes the experimental methodology and findings in prose, without presenting any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is provided in the supplementary material, and will be open-sourced. |
| Open Datasets | Yes | We use questions from the GPQA-Diamond dataset [Rein et al., 2023], a hard dataset of 198 multiple-choice questions in biology, physics, and chemistry |
| Dataset Splits | No | The paper uses the GPQA-Diamond dataset and classifies question hardness within it (e.g., 'Easy', 'Medium', 'Hard' as shown in Figure 4), but it does not specify explicit training, validation, or test splits for models used in its experiments. |
| Hardware Specification | No | We evaluate most models through Open Router, and Claude models through the Anthropic API. |
| Software Dependencies | No | We prompt GPT-4o to evaluate the Co Ts for legibility on a scale of 1-9 |
| Experiment Setup | Yes | By default, we sample models with temperature 1. We use questions from the GPQA-Diamond dataset [Rein et al., 2023], a hard dataset of 198 multiple-choice questions in biology, physics, and chemistry, to construct our prompt pairs we remove the answer choices to make the questions harder. ... We prompt GPT-4o to evaluate the Co Ts for legibility on a scale of 1-9, with 1 being the most legible and 9 the least. |