Self-Consistency Improves Chain of Thought Reasoning in Language Models

Authors: Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks
Researcher Affiliation Industry Google Research, Brain Team xuezhiw@google.com, dennyzhou@google.com
Pseudocode No The paper describes the steps of the method (Figure 1) but does not provide formal pseudocode or an algorithm block.
Open Source Code No The paper mentions that UL2 is open-sourced and GPT-3 has a public API, which are models used in the research, but it does not provide source code for the self-consistency methodology itself.
Open Datasets Yes We evaluate self-consistency on the following reasoning benchmarks.3 Arithmetic reasoning. For these tasks, we used the Math Word Problem Repository (Koncel Kedziorski et al., 2016), including Add Sub (Hosseini et al., 2014), Multi Arith (Roy & Roth, 2015), and ASDiv (Miao et al., 2020). We also included AQUA-RAT (Ling et al., 2017), a recently published benchmark of grade-school-math problems (GSM8K; Cobbe et al., 2021), and a challenge dataset over math word problems (SVAMP; Patel et al., 2021). Commonsense reasoning. For these tasks, we used Commonsense QA (Talmor et al., 2019), Strategy QA (Geva et al., 2021), and the AI2 Reasoning Challenge (ARC) (Clark et al., 2018).
Dataset Splits Yes By default we use the test split for all datasets if the labels are available for evaluation. For Commonsense QA we use the dev split;
Hardware Specification Yes For UL2 we use TPU v3 (2x2 configuration, 4 chips, 8 cores). For La MDA-137B we use TPU v3 (8x8 configuration, 64 chips, 128 cores). For Pa LM-540B we use TPU v4 (4x4x12 configuration, 192 chips, 384 cores).
Software Dependencies No The paper mentions specific language models and public APIs (e.g., GPT-3 code-davinci-001 and code-davinci-002 via OpenAI API) but does not provide specific version numbers for any underlying software libraries, frameworks, or programming languages used to implement their method.
Experiment Setup Yes In particular, for UL2-20B and La MDA-137B we applied temperature sampling with T = 0.5 and truncated at the top-k (k = 40) tokens with the highest probability, for Pa LM-540B we applied T = 0.7, k = 40, and for GPT-3 we use T = 0.7 without top-k truncation.