reproducibilityindex.ai

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Authors: Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks
Researcher Affiliation	Industry	Google Research, Brain Team xuezhiw@google.com, dennyzhou@google.com
Pseudocode	No	The paper describes the steps of the method (Figure 1) but does not provide formal pseudocode or an algorithm block.
Open Source Code	No	The paper mentions that UL2 is open-sourced and GPT-3 has a public API, which are models used in the research, but it does not provide source code for the self-consistency methodology itself.
Open Datasets	Yes	We evaluate self-consistency on the following reasoning benchmarks.3 Arithmetic reasoning. For these tasks, we used the Math Word Problem Repository (Koncel Kedziorski et al., 2016), including Add Sub (Hosseini et al., 2014), Multi Arith (Roy & Roth, 2015), and ASDiv (Miao et al., 2020). We also included AQUA-RAT (Ling et al., 2017), a recently published benchmark of grade-school-math problems (GSM8K; Cobbe et al., 2021), and a challenge dataset over math word problems (SVAMP; Patel et al., 2021). Commonsense reasoning. For these tasks, we used Commonsense QA (Talmor et al., 2019), Strategy QA (Geva et al., 2021), and the AI2 Reasoning Challenge (ARC) (Clark et al., 2018).
Dataset Splits	Yes	By default we use the test split for all datasets if the labels are available for evaluation. For Commonsense QA we use the dev split;
Hardware Specification	Yes	For UL2 we use TPU v3 (2x2 conﬁguration, 4 chips, 8 cores). For La MDA-137B we use TPU v3 (8x8 conﬁguration, 64 chips, 128 cores). For Pa LM-540B we use TPU v4 (4x4x12 conﬁguration, 192 chips, 384 cores).
Software Dependencies	No	The paper mentions specific language models and public APIs (e.g., GPT-3 code-davinci-001 and code-davinci-002 via OpenAI API) but does not provide specific version numbers for any underlying software libraries, frameworks, or programming languages used to implement their method.
Experiment Setup	Yes	In particular, for UL2-20B and La MDA-137B we applied temperature sampling with T = 0.5 and truncated at the top-k (k = 40) tokens with the highest probability, for Pa LM-540B we applied T = 0.7, k = 40, and for GPT-3 we use T = 0.7 without top-k truncation.