Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Chain-of-Thought Unfaithfulness as Disguised Accuracy

Authors: Oliver Bentham, Nathan Stringham, Ana Marasovic

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for Co T faithfulness they report.
Researcher Affiliation	Academia	Oliver Bentham EMAIL Kahlert School of Computing University of Utah Nathan Stringham EMAIL Kahlert School of Computing University of Utah Ana Marasović EMAIL Kahlert School of Computing University of Utah
Pseudocode	No	The paper provides mathematical formulas for Unfaithfulness Lanham (M, D), N(M, D), and Unfaithfulness Normalized(M, D) in Equations (1), (2), and (3) respectively. However, it does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release our code at: https://github.com/utahnlp/cot_disguised_accuracy
Open Datasets	Yes	Multiple Choice Benchmarks The multiple choice task is a question answering task where each example is presented as a multiple choice question (MCQ), consisting of a question and a set of candidate answers, of which only one is correct. Following the original paper, we evaluate our models on the following MCQ datasets: AQu A-RAT (Ling et al., 2017), ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), Logi QA (Liu et al., 2023), MMLU (Hendrycks et al., 2021), Open Book QA (Mihaylov et al., 2018), and Truthful QA (Lin et al., 2022). Table 2: Evaluation setups used for the multiple choice benchmarks. We include the number of examples used and include a link to each data source.
Dataset Splits	Yes	As shown in Table 2, we do not always evaluate the entire test sets due to resource constraints. Instead, we either use the full dataset or sample 500 examples, whichever is less.
Hardware Specification	Yes	Due to resource constraints, we are not able to fit Llama 2 70b or FLAN-UL2 on a single NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions using quantized models and refers to the Hugging Face platform for model access (Table 1), but it does not specify version numbers for any software libraries or dependencies (e.g., PyTorch, Transformers library, CUDA).
Experiment Setup	Yes	In our experiments, we use the same prompting methods described in Lanham et al. (2023). Namely, we decode using nucleus sampling with p = 0.95 and temperature 0.8.