Complexity-Based Prompting for Multi-step Reasoning
Authors: Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, Multi Arith, and Math QA) and two Big Bench Hard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements. |
| Researcher Affiliation | Collaboration | University of Edinburgh Allen Institute for AI |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Work done during internship at Allen Institute for AI, code at https://github.com/Franx Yao/Complexity-Based-Prompting |
| Open Datasets | Yes | We use three math word problems datasets (GSM8K, Multi Arith, and Math QA) and three non-math reasoning (Strategy QA, Date Understanding, and Penguins) as our testbed. We choose GSM8K and Multi Arith also because they are the datasets used by prior work on Co Ts (Wei et al., 2022b; Wang et al., 2022b; Kojima et al., 2022), allowing fair comparison to existing methods. |
| Dataset Splits | Yes | For each dataset, we randomly draw 200 instances from the training data to create a validation split. |
| Hardware Specification | No | The paper indicates use of GPT-3 and Codex (OpenAI API), which abstracts hardware details. It does not specify the hardware used by the authors for running their experiments or interacting with these models. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that were used to implement or run the experiments. |
| Experiment Setup | Yes | All prompts for math datasets contain 8 cases (a case = a question + a chain of thoughts + an answer). For multi-step reasoning tasks, we follow the chain-of-thoughts prompting framework and compare all prompting schemes using GPT-3 text-davinci-002 and Codex code-davinci-002. Following Kojima et al. (2022), we add Let s think step by step before the reasoning chains for all prompting schemes to improve the performance. In our experiments, we set N to 50, and observe that the optimal K is always smaller than N (typically 30-40). |