reproducibilityindex.ai

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Authors: Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V Le, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments of STEP-BACK PROMPTING with Pa LM-2L, GPT-4 and Llama2-70B models, and observe substantial performance gains on various challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, STEP-BACK PROMPTING improves Pa LM-2L performance on MMLU (Physics and Chemistry) by 7% and 11% respectively, Time QA by 27%, and Mu Si Que by 7%.
Researcher Affiliation	Industry	Google DeepMind
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets	Yes	MMLU (Hendrycks et al., 2020) contains a series of benchmarks across diverse domains to evaluate the model s language understanding. We consider the high school physics and chemistry portions of MMLU because of the deep reasoning involved.
Dataset Splits	Yes	We evaluate the models on the test set of Time QA. As shown in Table 2...Table 5: Stats of the evaluation datasets used in this paper. Domain Dataset Split Number of Examples STEM MMLU high-school Physics Test 151...
Hardware Specification	No	The paper mentions using Pa LM-2L, GPT-4, and Llama2-70B models, but does not provide any specific details about the hardware (e.g., GPU models, CPU types, or cloud compute specifications) used for running the experiments.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments, such as programming languages, frameworks, or specialized packages.
Experiment Setup	Yes	We use few-shot exemplar demonstrations to execute STEP-BACK PROMPTING on LLMs. We use Pa LM-2L as the scoring model for evaluation. We experiment with different sampling temperatures, and ﬁnd that T = 1 gives us a highly-accurate evaluation.