reproducibilityindex.ai

Complexity-Based Prompting for Multi-step Reasoning

Authors: Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, Tushar Khot

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When used to prompt GPT-3 and Codex, our approach substantially improves multi-step reasoning accuracy and achieves new state-of-the-art (SOTA) performance on three math benchmarks (GSM8K, Multi Arith, and Math QA) and two Big Bench Hard tasks (Date Understanding and Penguins), with an average +5.3 and up to +18 accuracy improvements.
Researcher Affiliation	Collaboration	University of Edinburgh Allen Institute for AI
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Work done during internship at Allen Institute for AI, code at https://github.com/Franx Yao/Complexity-Based-Prompting
Open Datasets	Yes	We use three math word problems datasets (GSM8K, Multi Arith, and Math QA) and three non-math reasoning (Strategy QA, Date Understanding, and Penguins) as our testbed. We choose GSM8K and Multi Arith also because they are the datasets used by prior work on Co Ts (Wei et al., 2022b; Wang et al., 2022b; Kojima et al., 2022), allowing fair comparison to existing methods.
Dataset Splits	Yes	For each dataset, we randomly draw 200 instances from the training data to create a validation split.
Hardware Specification	No	The paper indicates use of GPT-3 and Codex (OpenAI API), which abstracts hardware details. It does not specify the hardware used by the authors for running their experiments or interacting with these models.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions) that were used to implement or run the experiments.
Experiment Setup	Yes	All prompts for math datasets contain 8 cases (a case = a question + a chain of thoughts + an answer). For multi-step reasoning tasks, we follow the chain-of-thoughts prompting framework and compare all prompting schemes using GPT-3 text-davinci-002 and Codex code-davinci-002. Following Kojima et al. (2022), we add Let s think step by step before the reasoning chains for all prompting schemes to improve the performance. In our experiments, we set N to 50, and observe that the optimal K is always smaller than N (typically 30-40).