Self-Consistency Improves Chain of Thought Reasoning in Language Models
Authors: Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, Denny Zhou
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks |
| Researcher Affiliation | Industry | Google Research, Brain Team xuezhiw@google.com, dennyzhou@google.com |
| Pseudocode | No | The paper describes the steps of the method (Figure 1) but does not provide formal pseudocode or an algorithm block. |
| Open Source Code | No | The paper mentions that UL2 is open-sourced and GPT-3 has a public API, which are models used in the research, but it does not provide source code for the self-consistency methodology itself. |
| Open Datasets | Yes | We evaluate self-consistency on the following reasoning benchmarks.3 Arithmetic reasoning. For these tasks, we used the Math Word Problem Repository (Koncel Kedziorski et al., 2016), including Add Sub (Hosseini et al., 2014), Multi Arith (Roy & Roth, 2015), and ASDiv (Miao et al., 2020). We also included AQUA-RAT (Ling et al., 2017), a recently published benchmark of grade-school-math problems (GSM8K; Cobbe et al., 2021), and a challenge dataset over math word problems (SVAMP; Patel et al., 2021). Commonsense reasoning. For these tasks, we used Commonsense QA (Talmor et al., 2019), Strategy QA (Geva et al., 2021), and the AI2 Reasoning Challenge (ARC) (Clark et al., 2018). |
| Dataset Splits | Yes | By default we use the test split for all datasets if the labels are available for evaluation. For Commonsense QA we use the dev split; |
| Hardware Specification | Yes | For UL2 we use TPU v3 (2x2 configuration, 4 chips, 8 cores). For La MDA-137B we use TPU v3 (8x8 configuration, 64 chips, 128 cores). For Pa LM-540B we use TPU v4 (4x4x12 configuration, 192 chips, 384 cores). |
| Software Dependencies | No | The paper mentions specific language models and public APIs (e.g., GPT-3 code-davinci-001 and code-davinci-002 via OpenAI API) but does not provide specific version numbers for any underlying software libraries, frameworks, or programming languages used to implement their method. |
| Experiment Setup | Yes | In particular, for UL2-20B and La MDA-137B we applied temperature sampling with T = 0.5 and truncated at the top-k (k = 40) tokens with the highest probability, for Pa LM-540B we applied T = 0.7, k = 40, and for GPT-3 we use T = 0.7 without top-k truncation. |