Complexity-Based Prompting for Multi-step Reasoning

Authors: Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, Multi Arith, and Math QA) and two Big Bench Hard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements.
Researcher Affiliation Collaboration University of Edinburgh Allen Institute for AI
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Work done during internship at Allen Institute for AI, code at https://github.com/Franx Yao/Complexity-Based-Prompting
Open Datasets Yes We use three math word problems datasets (GSM8K, Multi Arith, and Math QA) and three non-math reasoning (Strategy QA, Date Understanding, and Penguins) as our testbed. We choose GSM8K and Multi Arith also because they are the datasets used by prior work on Co Ts (Wei et al., 2022b; Wang et al., 2022b; Kojima et al., 2022), allowing fair comparison to existing methods.
Dataset Splits Yes For each dataset, we randomly draw 200 instances from the training data to create a validation split.
Hardware Specification No The paper indicates use of GPT-3 and Codex (OpenAI API), which abstracts hardware details. It does not specify the hardware used by the authors for running their experiments or interacting with these models.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that were used to implement or run the experiments.
Experiment Setup Yes All prompts for math datasets contain 8 cases (a case = a question + a chain of thoughts + an answer). For multi-step reasoning tasks, we follow the chain-of-thoughts prompting framework and compare all prompting schemes using GPT-3 text-davinci-002 and Codex code-davinci-002. Following Kojima et al. (2022), we add Let s think step by step before the reasoning chains for all prompting schemes to improve the performance. In our experiments, we set N to 50, and observe that the optimal K is always smaller than N (typically 30-40).