Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning

Authors: Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	This work investigates the neural sub-structures within LLMs that manifest Co T reasoning from a mechanistic point of view. From an analysis of Llama-2 7B applied to multistep reasoning over fictional ontologies, we demonstrate that LLMs deploy multiple parallel pathways of answer generation for step-by-step reasoning. These parallel pathways provide sequential answers from the input question context as well as the generated Co T. Our findings supply empirical answers to a pertinent open question about whether LLMs actually rely on Co T to answer questions (Tan, 2023; Lampinen et al., 2022).
Researcher Affiliation	Academia	Subhabrata Dutta EMAIL IIT Delhi, India Joykirat Singh EMAIL Independent Soumen Chakrabarti EMAIL IIT Bombay, India Tanmoy Chakraborty EMAIL IIT Delhi, India
Pseudocode	No	The paper describes methods and procedures in prose, but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The source code and data are made available at https://github.com/joykirat18/How-To-Think-Step-by-Step.
Open Datasets	Yes	To minimize the effects of MLP blocks and focus primarily on reasoning from the provided context, we make use of the Pr Onto QA dataset (Saparov & He, 2023) that employs ontology-based question answering using fictional entities (see Figure 1 for an example).
Dataset Splits	Yes	Total training pairs: 28392; Total testing pairs: 9204. All three types of pairs (positively and negatively related and unrelated) are present in equal proportion in the training and testing data.
Hardware Specification	No	The paper mentions using 'Llama-2 7B' as the model, but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using 'Llama-2 7B' but does not specify any other software dependencies or their version numbers, such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	We use 6-shot examples of Co T for generation in all the experiments. ... 4-layer MLP model, 4096 * 2 -> 128 -> 64 -> 32 -> 3. With ReLU in between each Linear layer. Learning rate: 0.00005 Number of epochs: 120.