Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Authors: Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work investigates the neural sub-structures within LLMs that manifest Co T reasoning from a mechanistic point of view. From an analysis of Llama-2 7B applied to multistep reasoning over fictional ontologies, we demonstrate that LLMs deploy multiple parallel pathways of answer generation for step-by-step reasoning. These parallel pathways provide sequential answers from the input question context as well as the generated Co T. Our findings supply empirical answers to a pertinent open question about whether LLMs actually rely on Co T to answer questions (Tan, 2023; Lampinen et al., 2022). |
| Researcher Affiliation | Academia | Subhabrata Dutta EMAIL IIT Delhi, India Joykirat Singh EMAIL Independent Soumen Chakrabarti EMAIL IIT Bombay, India Tanmoy Chakraborty EMAIL IIT Delhi, India |
| Pseudocode | No | The paper describes methods and procedures in prose, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and data are made available at https://github.com/joykirat18/How-To-Think-Step-by-Step. |
| Open Datasets | Yes | To minimize the effects of MLP blocks and focus primarily on reasoning from the provided context, we make use of the Pr Onto QA dataset (Saparov & He, 2023) that employs ontology-based question answering using fictional entities (see Figure 1 for an example). |
| Dataset Splits | Yes | Total training pairs: 28392; Total testing pairs: 9204. All three types of pairs (positively and negatively related and unrelated) are present in equal proportion in the training and testing data. |
| Hardware Specification | No | The paper mentions using 'Llama-2 7B' as the model, but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Llama-2 7B' but does not specify any other software dependencies or their version numbers, such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We use 6-shot examples of Co T for generation in all the experiments. ... 4-layer MLP model, 4096 * 2 -> 128 -> 64 -> 32 -> 3. With ReLU in between each Linear layer. Learning rate: 0.00005 Number of epochs: 120. |