Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

Authors: Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, Kyunghyun Cho

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks. [...] We show empirically that a wide range of pre-trained language models demonstrate low consistency rates on both hypothetical transformations (Section 3) and compositional transformations (Section 4).
Researcher Affiliation	Collaboration	Angelica Chen EMAIL Center for Data Science, New York University; Alicia Parrish EMAIL Department of Linguistics, New York University; Google; Samuel R. Bowman EMAIL Center for Data Science, New York University; Anthropic
Pseudocode	No	No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found. The paper defines formal concepts and functions but does not present structured pseudocode for its methodology.
Open Source Code	No	The paper does not contain any explicit statement about making the source code for their methodology publicly available, nor does it provide a link to a code repository. It only mentions access to OpenAI models via an API program in the acknowledgements: 'Open AI for providing access to and credits for their models via the API Academic Access Program.'
Open Datasets	Yes	Wikipedia Since language models are frequently pre-trained on Wikipedia archives, evaluating on a Wikipedia dataset can confound information memorized during pre-training with the skill being evaluated. To address this issue, we collect a sample of 400 English Wikipedia (Wikimedia Foundation) articles that were created on or after June 30, 2021 [...] Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org. [...] Daily Dialog: Daily Dialog (Li et al., 2017) is a manually labeled dataset of multi-turn conversations about daily life. [...] Geo Query (Zelle & Mooney, 1996) is a semantic parsing dataset consisting of 880 natural language questions about US geography [...]
Dataset Splits	Yes	We collect a sample of 400 English Wikipedia (Wikimedia Foundation) articles that were created on or after June 30, 2021 [...] Each initial prompt is a randomly selected segment of a Wikipedia article [...] We randomly sample 400 examples from the training split and use the first conversational turn as the initial prompt. [...] We generate a set of 400 randomly-nested arithmetic expressions as the initial prompts. [...] a sample of 400 Geo Query training examples
Hardware Specification	No	The paper states, 'All experiments are run using greedily decoded completions obtained from the Open AI API from Aug. 2022 to Jun. 2023.' and 'We also thank the NYU High-Performance Computing Center for in-kind support'. However, it does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions 'All experiments are run using greedily decoded completions obtained from the Open AI API from Aug. 2022 to Jun. 2023.' This indicates the use of the OpenAI API but does not specify version numbers for the API itself or any other software libraries or frameworks used in their implementation.
Experiment Setup	Yes	All experiments are run using greedily decoded completions obtained from the Open AI API from Aug. 2022 to Jun. 2023. We use 0-shot initial prompts but evaluate hypothetical consistency prompts using k-shot prompts, where k ranges from 1 to 10. Since the in-context performance of LLMs is known to vary depending on the selection of in-context examples (Liu et al., 2021; Rubin et al., 2022), we randomly select a different set of in-context examples for each prompt. We also randomize the order of answer choices for each multiple-choice question to mitigate sensitivity to answer choice order (Pezeshkpour & Hruschka, 2023). We also vary the number of words m in the original completion that the model is asked to distinguish from 1 to 6 [...] We then collect model completions for all possible sub-expressions of each expression using k-shot prompts, with k ranging from 3 to 10. We randomly select the in-context examples for each prompt.