Premise Order Matters in Reasoning with Large Language Models
Authors: Xinyun Chen, Ryan Andrew Chi, Xuezhi Wang, Denny Zhou
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first examine the effect of premise ordering on deductive reasoning on a variety of LLMs, and our evaluation shows that...In addition, we release the benchmark R-GSM, based on GSM8K, to examine the ordering effect for mathematical problem-solving, and we again observe a significant drop in accuracy, relative to the original GSM8K benchmark. |
| Researcher Affiliation | Collaboration | 1Google Deep Mind 2Stanford University. Correspondence to: Xinyun Chen <xinyunchen@google.com>, Ryan A. Chi <ryanchi@cs.stanford.edu>. |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions releasing the R-GSM benchmark, which is a dataset, but does not state that the source code for their methodology is open-source or available. |
| Open Datasets | Yes | Besides logical reasoning, we construct R-GSM to further investigate the ordering effect on mathematical reasoning. Specifically, we build R-GSM on top of a subset of the GSM8K benchmark (Cobbe et al., 2021)... |
| Dataset Splits | No | The paper mentions using GSM8K test problems and generating logical reasoning problems, but does not provide explicit train/validation/test splits for reproducibility. |
| Hardware Specification | No | The paper does not specify any hardware used for the experiments (e.g., GPU/CPU models, memory). |
| Software Dependencies | No | The paper lists models used (e.g., GPT-4-turbo, PaLM 2-L) but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | We perform the greedy decoding with the temperature 0, and apply the zero-shot prompting in all experiments unless otherwise specified. |