Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models
Authors: Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V Le, Denny Zhou
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments of STEP-BACK PROMPTING with Pa LM-2L, GPT-4 and Llama2-70B models, and observe substantial performance gains on various challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, STEP-BACK PROMPTING improves Pa LM-2L performance on MMLU (Physics and Chemistry) by 7% and 11% respectively, Time QA by 27%, and Mu Si Que by 7%. |
| Researcher Affiliation | Industry | Google DeepMind |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | Yes | MMLU (Hendrycks et al., 2020) contains a series of benchmarks across diverse domains to evaluate the model s language understanding. We consider the high school physics and chemistry portions of MMLU because of the deep reasoning involved. |
| Dataset Splits | Yes | We evaluate the models on the test set of Time QA. As shown in Table 2...Table 5: Stats of the evaluation datasets used in this paper. Domain Dataset Split Number of Examples STEM MMLU high-school Physics Test 151... |
| Hardware Specification | No | The paper mentions using Pa LM-2L, GPT-4, and Llama2-70B models, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud compute specifications) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments, such as programming languages, frameworks, or specialized packages. |
| Experiment Setup | Yes | We use few-shot exemplar demonstrations to execute STEP-BACK PROMPTING on LLMs. We use Pa LM-2L as the scoring model for evaluation. We experiment with different sampling temperatures, and find that T = 1 gives us a highly-accurate evaluation. |