Premise Order Matters in Reasoning with Large Language Models

Authors: Xinyun Chen, Ryan Andrew Chi, Xuezhi Wang, Denny Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that...In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark.
Researcher Affiliation Collaboration 1Google Deep Mind 2Stanford University. Correspondence to: Xinyun Chen <xinyunchen@google.com>, Ryan A. Chi <ryanchi@cs.stanford.edu>.
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper mentions releasing the R-GSM benchmark, which is a dataset, but does not state that the source code for their methodology is open-source or available.
Open Datasets Yes Besides logical reasoning, we construct R-GSM to further investigate the ordering effect on mathematical reasoning. Specifically, we build R-GSM on top of a subset of the GSM8K benchmark (Cobbe et al., 2021)...
Dataset Splits No The paper mentions using GSM8K test problems and generating logical reasoning problems, but does not provide explicit train/validation/test splits for reproducibility.
Hardware Specification No The paper does not specify any hardware used for the experiments (e.g., GPU/CPU models, memory).
Software Dependencies No The paper lists models used (e.g., GPT-4-turbo, PaLM 2-L) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes We perform the greedy decoding with the temperature 0, and apply the zero-shot prompting in all experiments unless otherwise specified.