Self-Evaluation Guided Beam Search for Reasoning
Authors: Yuxi Xie, Kenji Kawaguchi, Yiran Zhao, James Xu Zhao, Min-Yen Kan, Junxian He, Michael Xie
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by 6.34%, 9.56%, and 5.46% on the GSM8K, AQu A, and Strategy QA benchmarks, respectively. Experiment results with Llama-2 on arithmetic reasoning demonstrate the efficiency of our method in outperforming the baseline methods with comparable computational budgets. |
| Researcher Affiliation | Academia | 1 National University of Singapore 2 The Hong Kong University of Science and Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks, nor does it explicitly label any section as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Our code is publicly available at https://guideddecoding.github.io/. |
| Open Datasets | Yes | We evaluate the effectiveness of our approach across three types of reasoning tasks: (1) Arithmetic Reasoning on five math word problem benchmarks, including GSM8K (Cobbe et al., 2021) on math word problems, AQu A (Ling et al., 2017) on algebraic word problems, SVAMP (Patel et al., 2021) on structure variations of math word problems, ASDiv (Miao et al., 2020) on diverse math word problems, and Tab MWP (Lu et al., 2023) on tabular math word problems; (2) Symbolic Reasoning on BIG-Bench (Srivastava et al., 2022)...; (3) Commonsense Reasoning on three benchmarks, including Commonsense QA (Talmor et al., 2019)..., Strategy QA (Geva et al., 2021)... |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits (e.g., specific percentages or sample counts). It refers to using 'training data' for creating few-shot exemplars, but no detailed split information is given. |
| Hardware Specification | No | The paper states, 'The computational work for this article was partially performed on resources of the National Supercomputing Centre (NSCC), Singapore7.', but it does not specify exact GPU/CPU models, processor types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper mentions specific LLM backbones like 'Codex (code-davinci-002)', 'Llama-2 (13B)', and 'Chat GPT (gpt-3.5-turbo)', but does not provide specific version numbers for ancillary software dependencies such as programming languages, libraries, or frameworks (e.g., Python version, PyTorch version). |
| Experiment Setup | Yes | We set k = 5, n = 16 with Codex backbone... The maximum number of steps to decode is capped at 16... We set generation temperatures differently for various tasks and baselines. Regarding the generation temperature γ on Codex, for arithmetic and symbolic reasoning with PAL using deterministic beam search (τ = 0.0), we find that γ ∈ [0.4, 0.8] generally works well... we use α = 0.5 for all datasets but different values of τ for each task. Specifically, we choose τ = 0.5 for PAL and τ = 0.2 for CoT... |