Large Language Models are Zero-Shot Reasoners
Authors: Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results demonstrate that our Zero-shot-Co T, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (Multi Arith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on Multi Arith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large-scale Instruct GPT model (text-davinci002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter Pa LM. |
| Researcher Affiliation | Collaboration | Takeshi Kojima The University of Tokyo t.kojima@weblab.t.u-tokyo.ac.jp Shixiang Shane Gu Google Research, Brain Team Machel Reid Google Research Yutaka Matsuo The University of Tokyo Yusuke Iwasawa The University of Tokyo |
| Pseudocode | No | The paper describes its two-stage prompting process in detail within the text (Section 3.1), but it does not present this process as a formal pseudocode block or algorithm. |
| Open Source Code | No | The paper does not provide a direct URL or an explicit statement in the main text indicating the release of the source code for the described methodology. |
| Open Datasets | Yes | For arithmetic reasoning, we consider the following six datasets: (1) Single Eq [Koncel-Kedziorski et al., 2015], (2) Add Sub [Hosseini et al., 2014], (3) Multi Arith [Roy and Roth, 2015], (4) AQUA-RAT [Ling et al., 2017], (5) GSM8K [Cobbe et al., 2021], and (6) SVAMP [Patel et al., 2021]. For commonsense reasoning, we use Commonsense QA [Talmor et al., 2019] and Strategy QA [Geva et al., 2021]. For symbolic reasoning, we use Last Letter Concatenation and Coin Flip [Wei et al., 2022]. For other logical reasoning tasks, we choose two evaluation sets from the BIG-bench effort [Srivastava et al., 2022]: Date Understanding and Tracking Shuffled Objects. |
| Dataset Splits | No | The paper mentions that for few-shot approaches, they 'run each experiment only once with a fixed seed across all methods and datasets, for fair comparisons with the zero-shot methods', but it does not explicitly provide details about specific train/validation/test dataset splits used for reproducibility. |
| Hardware Specification | Yes | Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) was used for experiments other than Pa LM. |
| Software Dependencies | No | The paper mentions various models (e.g., Instruct GPT3, PaLM) and underlying frameworks (e.g., Pytorch, TensorFlow are cited in references for other works), but it does not specify explicit version numbers for the software dependencies used to run their experiments. |
| Experiment Setup | Yes | Unless otherwise stated, we use text-davinci-002 throughout the experiments. |