Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning Models Sometimes Output Illegible Chains of Thought

Authors: Arun Jose

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we study whether outcome-based RL causes reasoning models to do meaningful reasoning in illegible Co T. We evaluate the legibility of 14 models, including Deep Seek R1 and its distills, R1-Zero, Qw Q, Qwen3, Kimi K2, and various Claude models when reasoning about difficult scientific questions [Rein et al., 2023], and score their outputs on legibility using GPT-4o.
Researcher Affiliation Academia Arun Jose Independent
Pseudocode No The paper describes the experimental methodology and findings in prose, without presenting any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is provided in the supplementary material, and will be open-sourced.
Open Datasets Yes We use questions from the GPQA-Diamond dataset [Rein et al., 2023], a hard dataset of 198 multiple-choice questions in biology, physics, and chemistry
Dataset Splits No The paper uses the GPQA-Diamond dataset and classifies question hardness within it (e.g., 'Easy', 'Medium', 'Hard' as shown in Figure 4), but it does not specify explicit training, validation, or test splits for models used in its experiments.
Hardware Specification No We evaluate most models through Open Router, and Claude models through the Anthropic API.
Software Dependencies No We prompt GPT-4o to evaluate the Co Ts for legibility on a scale of 1-9
Experiment Setup Yes By default, we sample models with temperature 1. We use questions from the GPQA-Diamond dataset [Rein et al., 2023], a hard dataset of 198 multiple-choice questions in biology, physics, and chemistry, to construct our prompt pairs we remove the answer choices to make the questions harder. ... We prompt GPT-4o to evaluate the Co Ts for legibility on a scale of 1-9, with 1 being the most legible and 9 the least.