reproducibilityindex.ai

Premise Order Matters in Reasoning with Large Language Models

Authors: Xinyun Chen, Ryan Andrew Chi, Xuezhi Wang, Denny Zhou

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁrst examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that...In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a signiﬁcant drop in accuracy, relative to the original GSM8K benchmark.
Researcher Affiliation	Collaboration	1Google Deep Mind 2Stanford University. Correspondence to: Xinyun Chen <xinyunchen@google.com>, Ryan A. Chi <ryanchi@cs.stanford.edu>.
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions releasing the R-GSM benchmark, which is a dataset, but does not state that the source code for their methodology is open-source or available.
Open Datasets	Yes	Besides logical reasoning, we construct R-GSM to further investigate the ordering effect on mathematical reasoning. Speciﬁcally, we build R-GSM on top of a subset of the GSM8K benchmark (Cobbe et al., 2021)...
Dataset Splits	No	The paper mentions using GSM8K test problems and generating logical reasoning problems, but does not provide explicit train/validation/test splits for reproducibility.
Hardware Specification	No	The paper does not specify any hardware used for the experiments (e.g., GPU/CPU models, memory).
Software Dependencies	No	The paper lists models used (e.g., GPT-4-turbo, PaLM 2-L) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	We perform the greedy decoding with the temperature 0, and apply the zero-shot prompting in all experiments unless otherwise speciﬁed.