Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Authors: Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V Le, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments of STEP-BACK PROMPTING with Pa LM-2L, GPT-4 and Llama2-70B models, and observe substantial performance gains on various challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, STEP-BACK PROMPTING improves Pa LM-2L performance on MMLU (Physics and Chemistry) by 7% and 11% respectively, Time QA by 27%, and Mu Si Que by 7%.
Researcher Affiliation Industry Google DeepMind
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets Yes MMLU (Hendrycks et al., 2020) contains a series of benchmarks across diverse domains to evaluate the model s language understanding. We consider the high school physics and chemistry portions of MMLU because of the deep reasoning involved.
Dataset Splits Yes We evaluate the models on the test set of Time QA. As shown in Table 2...Table 5: Stats of the evaluation datasets used in this paper. Domain Dataset Split Number of Examples STEM MMLU high-school Physics Test 151...
Hardware Specification No The paper mentions using Pa LM-2L, GPT-4, and Llama2-70B models, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud compute specifications) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments, such as programming languages, frameworks, or specialized packages.
Experiment Setup Yes We use few-shot exemplar demonstrations to execute STEP-BACK PROMPTING on LLMs. We use Pa LM-2L as the scoring model for evaluation. We experiment with different sampling temperatures, and find that T = 1 gives us a highly-accurate evaluation.