Language Models Are Greedy Reasoners: A Systematic Formal Analysis of Chain-of-Thought
Authors: Abulhair Saparov, He He
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We systematically evaluate INSTRUCTGPT2 (Ouyang et al., 2022) and the original GPT-3 (Brown et al., 2020) on PRONTOQA by controlling a number of variables that characterize the complexity of the reasoning task, such as the ontology type and the number of proof steps required. Our analysis shows that these models are quite good at producing valid individual proof steps, even on fictional and counterfactual ontologies. |
| Researcher Affiliation | Academia | Abulhair Saparov & He He Center for Data Science, New York University, New York, NY 10011, USA {as17582,hhe}@nyu.edu |
| Pseudocode | Yes | Psuedocode of the procedure to evaluate proofs is given in algorithm 1 in the Appendix. |
| Open Source Code | Yes | All analysis code, data, data generation scripts, and model outputs are available at github.com/asaparov/prontoqa. |
| Open Datasets | Yes | To enable easy analysis of the Co T, we construct a new synthetic QA dataset called PRONTOQA, for Proof and Ontology-Generated Question-Answering. Inspired by the PROOFWRITER dataset (Tafjord et al., 2021)... All analysis code, data, data generation scripts, and model outputs are available at github.com/asaparov/prontoqa. |
| Dataset Splits | No | The paper describes using 8-shot in-context learning and evaluating on 400 examples ('For each combination of variables, we run the model on 400 examples generated from the testbed'), but it does not specify explicit training, validation, or test splits for the dataset in the traditional sense, as it leverages pre-trained LLMs with few-shot prompting for evaluation. |
| Hardware Specification | No | The paper states that experiments were run using the 'Open AI API' on specific dates and evaluates models like 'INSTRUCTGPT and the original GPT-3 (Open AI models text-ada-001, text-babbage-001, text-curie-001, davinci, text-davinci-001, text-davinci-002)', but it does not provide any specific hardware details (e.g., GPU models, CPU types) used for running these API calls or the experiments. |
| Software Dependencies | No | The paper mentions running experiments using the 'Open AI API' and refers to models like 'INSTRUCTGPT' and 'GPT-3', and states that 'The command python analyze_results.py produces all figures used in this paper', but it does not specify version numbers for Python or any other specific software libraries or dependencies used in their experiments or analysis. |
| Experiment Setup | Yes | We use 8-shot in-context learning, so each input to the LLM consists of 8 fully-labeled questions followed by a single test question with missing Co T and label. The model s task is to predict the Co T and label for the test question. Note that all examples across all inputs are independently and identically generated from PRONTOQA. There are a number of variables that we control when generating examples in PRONTOQA: (1) the number of hops, (2) the ordering in which the sentences are generated from the ontology, and (3) the type of the ontology. The number of hops directly controls the difficulty of the generated example, and we experiment with 1, 3, and 5 hops. We control the ontology traversal direction: We either traverse the tree top-down (i.e., preorder) or bottom-up (i.e., postorder), generating a sentence for each traversed edge/node. The ordering also affects the difficulty of the generated example: if the sentences are generated bottom-up, they will follow the same order as the steps in the gold proof. On the other hand, if they are generated top-down, the order is reversed, and the task may be more difficult. For each combination of variables, we run the model on 400 examples generated from the testbed, for a total of 48 experiments. |