Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Chain-of-Thought Unfaithfulness as Disguised Accuracy
Authors: Oliver Bentham, Nathan Stringham, Ana Marasovic
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for Co T faithfulness they report. |
| Researcher Affiliation | Academia | Oliver Bentham EMAIL Kahlert School of Computing University of Utah Nathan Stringham EMAIL Kahlert School of Computing University of Utah Ana Marasoviฤ EMAIL Kahlert School of Computing University of Utah |
| Pseudocode | No | The paper provides mathematical formulas for Unfaithfulness Lanham (M, D), N(M, D), and Unfaithfulness Normalized(M, D) in Equations (1), (2), and (3) respectively. However, it does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our code at: https://github.com/utahnlp/cot_disguised_accuracy |
| Open Datasets | Yes | Multiple Choice Benchmarks The multiple choice task is a question answering task where each example is presented as a multiple choice question (MCQ), consisting of a question and a set of candidate answers, of which only one is correct. Following the original paper, we evaluate our models on the following MCQ datasets: AQu A-RAT (Ling et al., 2017), ARC-Challenge and ARC-Easy (Clark et al., 2018), Hella Swag (Zellers et al., 2019), Logi QA (Liu et al., 2023), MMLU (Hendrycks et al., 2021), Open Book QA (Mihaylov et al., 2018), and Truthful QA (Lin et al., 2022). Table 2: Evaluation setups used for the multiple choice benchmarks. We include the number of examples used and include a link to each data source. |
| Dataset Splits | Yes | As shown in Table 2, we do not always evaluate the entire test sets due to resource constraints. Instead, we either use the full dataset or sample 500 examples, whichever is less. |
| Hardware Specification | Yes | Due to resource constraints, we are not able to fit Llama 2 70b or FLAN-UL2 on a single NVIDIA A100 GPU. |
| Software Dependencies | No | The paper mentions using quantized models and refers to the Hugging Face platform for model access (Table 1), but it does not specify version numbers for any software libraries or dependencies (e.g., PyTorch, Transformers library, CUDA). |
| Experiment Setup | Yes | In our experiments, we use the same prompting methods described in Lanham et al. (2023). Namely, we decode using nucleus sampling with p = 0.95 and temperature 0.8. |