Escape Sky-high Cost: Early-stopping Self-Consistency for Multi-step Reasoning

Authors: Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, Kan Li

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on three popular categories of reasoning tasks: arithmetic, commonsense and symbolic reasoning over language models with varying scales. The empirical results show that ESC reduces the average number of sampling by a significant margin on six benchmarks, including MATH (-33.8%), GSM8K (-80.1%), Strategy QA (-76.8%), Commonsense QA (-78.5%), Coin Flip (-84.2%) and Last Letters (-67.4%), while attaining comparable performances *.
Researcher Affiliation Collaboration Yiwei Li1 , Peiwen Yuan1 , Shaoxiong Feng2, Boyuan Pan2, Xinglin Wang1, Bin Sun1, Heda Wang2, Kan Li1 1 School of Computer Science, Beijing Institute of Technology 2 Xiaohongshu Inc
Pseudocode Yes Algorithm 1 Early-Stopping Self-Consistency. Algorithm 2 Control Scheme for Early-Stop Self-Consistency.
Open Source Code Yes *Our code and data have been released on https://github.com/Yiwei98/ESC.
Open Datasets Yes We evaluate the proposed ESC on six benchmark datasets from three categories of reasoning tasks: For arithmetic reasoning, we consider MATH (Hendrycks et al., 2021) and GSM8K (Cobbe et al., 2021). ... For commonsense reasoning, Commonsense QA (Talmor et al., 2019) and Strategy QA (Geva et al., 2021) are used. For symbolic reasoning, we use Last Letter Concatenation and Coin Flip from Wei et al. (2022).
Dataset Splits No The paper references well-known datasets and implicitly uses their standard configurations, but it does not explicitly provide details about specific train/validation/test dataset splits (e.g., percentages, sample counts, or explicit citation for the splits themselves) needed for reproduction. It mentions 'evaluating the entire test set' for MATH, but no details for training or validation splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory, or type of computing cluster) used for running the experiments. It mentions using large language models like GPT-4 and Llama-2, but not the hardware these models were run on.
Software Dependencies No The paper mentions the language models used (GPT-4, GPT-3.5-Turbo, Llama-2 7b) and some decoding parameters (temperature, top_p), but it does not specify software dependencies with version numbers (e.g., specific Python version, deep learning framework versions like PyTorch or TensorFlow, or other libraries) required for replication.
Experiment Setup Yes The sampling temperature T for MATH is 0.5 while for other datasets is 0.7. GPT-4 and GPT3.5-Turbo samples predictions without truncating. For Llama 2, the threshold for top p truncation (Holtzman et al., 2020) is 0.9. ... Accordingly, the window size w for MATH is 8 and for others is 5.