Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Reasoning Models Sometimes Output Illegible Chains of Thought

Authors: Arun Jose

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we study whether outcome-based RL causes reasoning models to do meaningful reasoning in illegible Co T. We evaluate the legibility of 14 models, including Deep Seek R1 and its distills, R1-Zero, Qw Q, Qwen3, Kimi K2, and various Claude models when reasoning about difficult scientific questions [Rein et al., 2023], and score their outputs on legibility using GPT-4o.
Researcher Affiliation	Academia	Arun Jose Independent
Pseudocode	No	The paper describes the experimental methodology and findings in prose, without presenting any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is provided in the supplementary material, and will be open-sourced.
Open Datasets	Yes	We use questions from the GPQA-Diamond dataset [Rein et al., 2023], a hard dataset of 198 multiple-choice questions in biology, physics, and chemistry
Dataset Splits	No	The paper uses the GPQA-Diamond dataset and classifies question hardness within it (e.g., 'Easy', 'Medium', 'Hard' as shown in Figure 4), but it does not specify explicit training, validation, or test splits for models used in its experiments.
Hardware Specification	No	We evaluate most models through Open Router, and Claude models through the Anthropic API.
Software Dependencies	No	We prompt GPT-4o to evaluate the Co Ts for legibility on a scale of 1-9
Experiment Setup	Yes	By default, we sample models with temperature 1. We use questions from the GPQA-Diamond dataset [Rein et al., 2023], a hard dataset of 198 multiple-choice questions in biology, physics, and chemistry, to construct our prompt pairs we remove the answer choices to make the questions harder. ... We prompt GPT-4o to evaluate the Co Ts for legibility on a scale of 1-9, with 1 being the most legible and 9 the least.