Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Measuring the Faithfulness of Thinking Drafts in Large Reasoning Models

Authors: Zidi Xiong, Shan Chen, Zhenting Qi, Himabindu Lakkaraju

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments across six state-of-the-art LRMs. Our findings show that current LRMs demonstrate selective faithfulness to intermediate reasoning steps and frequently fail to faithfully align with the draft conclusions.
Researcher Affiliation Academia Zidi Xiong1 , Shan Chen1, Zhenting Qi1, Himabindu Lakkaraju1 1Harvard University Correspondence to: Zidi Xiong EMAIL and Himabindu Lakkaraju EMAIL
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks. The methodologies are described in prose within the main text and appendices.
Open Source Code No Justification: We will open source data and code after acceptance.
Open Datasets Yes Dataset Our experiments are conducted on the challenge reasoning dataset GPQA Diamond [20] and the factoid recall-based MMLU (global facts subset) [13, 10].
Dataset Splits No We use the GPQA Diamond dataset with 198 multiple-choice questions and the MMLU Redux [10] global facts subset, which includes 88 correct MMLU multiple-choice questions after filtering out factually incorrect choices.
Hardware Specification No Justification: We focus more on evaluation than on training, so we do not discuss the computer resources in detail.
Software Dependencies No We leverage GPT4O-MINI as an annotator to decompose the thinking draft into steps... We employ QWEN-2.5-INSTRUCT-32B as a classifier... All experiments use greedy decoding with temperature set to 0 to ensure maximum reproducibility.
Experiment Setup Yes All experiments use greedy decoding with temperature set to 0 to ensure maximum reproducibility. ... For Deep Seek-R1, we use the default nucleus sampling with temperature = 0.6 and top-p = 0.95 via the Deep Seek-R1 API. For self-generated drafts, we adopt greedy decoding with temperature = 0.