reproducibilityindex.ai

Large Language Models are Zero-Shot Reasoners

Authors: Takeshi Kojima, Shixiang (Shane) Gu, Machel Reid, Yutaka Matsuo, Yusuke Iwasawa

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that our Zero-shot-Co T, using the same single prompt template, significantly outperforms zero-shot LLM performances on diverse benchmark reasoning tasks including arithmetics (Multi Arith, GSM8K, AQUA-RAT, SVAMP), symbolic reasoning (Last Letter, Coin Flip), and other logical reasoning tasks (Date Understanding, Tracking Shuffled Objects), without any hand-crafted few-shot examples, e.g. increasing the accuracy on Multi Arith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with large-scale Instruct GPT model (text-davinci002), as well as similar magnitudes of improvements with another off-the-shelf large model, 540B parameter Pa LM.
Researcher Affiliation	Collaboration	Takeshi Kojima The University of Tokyo t.kojima@weblab.t.u-tokyo.ac.jp Shixiang Shane Gu Google Research, Brain Team Machel Reid Google Research Yutaka Matsuo The University of Tokyo Yusuke Iwasawa The University of Tokyo
Pseudocode	No	The paper describes its two-stage prompting process in detail within the text (Section 3.1), but it does not present this process as a formal pseudocode block or algorithm.
Open Source Code	No	The paper does not provide a direct URL or an explicit statement in the main text indicating the release of the source code for the described methodology.
Open Datasets	Yes	For arithmetic reasoning, we consider the following six datasets: (1) Single Eq [Koncel-Kedziorski et al., 2015], (2) Add Sub [Hosseini et al., 2014], (3) Multi Arith [Roy and Roth, 2015], (4) AQUA-RAT [Ling et al., 2017], (5) GSM8K [Cobbe et al., 2021], and (6) SVAMP [Patel et al., 2021]. For commonsense reasoning, we use Commonsense QA [Talmor et al., 2019] and Strategy QA [Geva et al., 2021]. For symbolic reasoning, we use Last Letter Concatenation and Coin Flip [Wei et al., 2022]. For other logical reasoning tasks, we choose two evaluation sets from the BIG-bench effort [Srivastava et al., 2022]: Date Understanding and Tracking Shuffled Objects.
Dataset Splits	No	The paper mentions that for few-shot approaches, they 'run each experiment only once with a fixed seed across all methods and datasets, for fair comparisons with the zero-shot methods', but it does not explicitly provide details about specific train/validation/test dataset splits used for reproducibility.
Hardware Specification	Yes	Computational resource of AI Bridging Cloud Infrastructure (ABCI) provided by National Institute of Advanced Industrial Science and Technology (AIST) was used for experiments other than Pa LM.
Software Dependencies	No	The paper mentions various models (e.g., Instruct GPT3, PaLM) and underlying frameworks (e.g., Pytorch, TensorFlow are cited in references for other works), but it does not specify explicit version numbers for the software dependencies used to run their experiments.
Experiment Setup	Yes	Unless otherwise stated, we use text-davinci-002 throughout the experiments.