Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning
Authors: Subhabrata Dutta, Joykirat Singh, Soumen Chakrabarti, Tanmoy Chakraborty
TMLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This work investigates the neural sub-structures within LLMs that manifest Co T reasoning from a mechanistic point of view. From an analysis of Llama-2 7B applied to multistep reasoning over fictional ontologies, we demonstrate that LLMs deploy multiple parallel pathways of answer generation for step-by-step reasoning. These parallel pathways provide sequential answers from the input question context as well as the generated Co T. Our findings supply empirical answers to a pertinent open question about whether LLMs actually rely on Co T to answer questions (Tan, 2023; Lampinen et al., 2022). |
| Researcher Affiliation | Academia | Subhabrata Dutta EMAIL IIT Delhi, India Joykirat Singh EMAIL Independent Soumen Chakrabarti EMAIL IIT Bombay, India Tanmoy Chakraborty EMAIL IIT Delhi, India |
| Pseudocode | No | The paper describes methods and procedures in prose, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The source code and data are made available at https://github.com/joykirat18/How-To-Think-Step-by-Step. |
| Open Datasets | Yes | To minimize the effects of MLP blocks and focus primarily on reasoning from the provided context, we make use of the Pr Onto QA dataset (Saparov & He, 2023) that employs ontology-based question answering using fictional entities (see Figure 1 for an example). |
| Dataset Splits | Yes | Total training pairs: 28392; Total testing pairs: 9204. All three types of pairs (positively and negatively related and unrelated) are present in equal proportion in the training and testing data. |
| Hardware Specification | No | The paper mentions using 'Llama-2 7B' as the model, but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using 'Llama-2 7B' but does not specify any other software dependencies or their version numbers, such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We use 6-shot examples of Co T for generation in all the experiments. ... 4-layer MLP model, 4096 * 2 -> 128 -> 64 -> 32 -> 3. With ReLU in between each Linear layer. Learning rate: 0.00005 Number of epochs: 120. |