Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Authors: Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V Le, Ed H. Chi

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results on tasks related to symbolic manipulation, compositional generalization, and math reasoning reveal that least-to-most prompting is capable of generalizing to more difficult problems than those seen in the prompts.
Researcher Affiliation Industry Denny Zhou Nathanael Sch arli Le Hou Jason Wei Nathan Scales Xuezhi Wang Dale Schuurmans Claire Cui Olivier Bousquet Quoc Le Ed Chi Google Research, Brain Team
Pseudocode No The paper describes the stages and steps of least-to-most prompting but does not provide any formally labeled pseudocode or algorithm blocks. Examples of prompts are shown, but they are not presented as pseudocode.
Open Source Code No The paper states, "We will release the full dataset upon publication of this paper" in Section 7.3, which refers to data, not the source code for the methodology. No explicit mention or link to open-source code for the described methods is provided.
Open Datasets Yes We randomly sample words in Wiktionary1 to construct testing lists with lengths varying from 4 to 12. For each given length, 500 lists are constructed. ... SCAN (Lake & Baroni, 2018) is probably the most popular benchmark for evaluating compositional generalization. ... In this section, we apply least-to-most prompting to solve math word problems in GSM8K (Cobbe et al., 2021) and DROP (Dua et al., 2019).
Dataset Splits No The paper mentions training and testing sets (e.g., for SCAN's length split, where "the action sequences in the training set (about 80% of the full set with over 20,000 examples) are shorter than the action sequences in the testing set"). However, it does not provide explicit details about a separate validation set split (e.g., percentages or counts) for reproducibility.
Hardware Specification No The paper mentions various GPT-3 models (e.g., "GPT-3 code-davinci-002", "text-davinci-002", "code-davinci-001") and "LM-540B" which are the language models used, but these are software models/APIs. It does not provide any specific hardware details such as GPU models, CPU types, or cloud instance specifications used for running the experiments.
Software Dependencies Yes A notable finding is that when the GPT-3 code-davinci-002 model is used with least-to-most prompting... The accuracies of different prompting methods with different language models are presented in Table 8. ... Using code-davinci-002, least-to-most prompting achieves an accuracy of 99.7% under length split. ... In addition, it may be interesting to note that code-davinci-002 consistently outperforms text-davinci-002, regardless of the prompting method. ... The table also contains the results from running against two additional GPT-3 models: text-davinci-002 and codex-davinci-001. ... Here, we report results using the text-davinci-002 model and a language model with 540 billion parameters (LM-540B).
Experiment Setup Yes We have included prompts for all the tasks in the Appendix. ... The prompt in this stage contains constant examples that demonstrate the decomposition, followed by the specific question to be decomposed. ... We use Python notation to make our prompts in least-to-most prompting and the baselines (stand few-shot prompting and chain-of-thought prompting) concise and meet the input size limit of language models (usually up to 2048 tokens). ... We compare here the effectiveness on compositional generalization of least-to-most prompting vs. chain-of-thought prompting by constructing for each prompting method a simple prompt context that contains a single example that is solvable with just 2 reasoning steps.