Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Chain-of-Thought Unfaithfulness as Disguised Accuracy

Authors: Oliver Bentham, Nathan Stringham, Ana Marasovic

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for Co T faithfulness they report.
Researcher Affiliation Academia Oliver Bentham EMAIL Kahlert School of Computing University of Utah Nathan Stringham EMAIL Kahlert School of Computing University of Utah Ana Marasoviฤ‡ EMAIL Kahlert School of Computing University of Utah
Pseudocode No The paper provides mathematical formulas for Unfaithfulness Lanham (M, D), N(M, D), and Unfaithfulness Normalized(M, D) in Equations (1), (2), and (3) respectively. However, it does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes We release our code at: https://github.com/utahnlp/cot_disguised_accuracy
Open Datasets Yes Multiple Choice Benchmarks The multiple choice task is a question answering task where each example is presented as a multiple choice question (MCQ), consisting of a question and a set of candidate answers, of which only one is correct. Following the original paper, we evaluate our models on the following MCQ datasets: AQu A-RAT (Ling et al., 2017), ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), Logi QA (Liu et al., 2023), MMLU (Hendrycks et al., 2021), Open Book QA (Mihaylov et al., 2018), and Truthful QA (Lin et al., 2022). Table 2: Evaluation setups used for the multiple choice benchmarks. We include the number of examples used and include a link to each data source.
Dataset Splits Yes As shown in Table 2, we do not always evaluate the entire test sets due to resource constraints. Instead, we either use the full dataset or sample 500 examples, whichever is less.
Hardware Specification Yes Due to resource constraints, we are not able to fit Llama 2 70b or FLAN-UL2 on a single NVIDIA A100 GPU.
Software Dependencies No The paper mentions using quantized models and refers to the Hugging Face platform for model access (Table 1), but it does not specify version numbers for any software libraries or dependencies (e.g., PyTorch, Transformers library, CUDA).
Experiment Setup Yes In our experiments, we use the same prompting methods described in Lanham et al. (2023). Namely, we decode using nucleus sampling with p = 0.95 and temperature 0.8.