Automatic Chain of Thought Prompting in Large Language Models
Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Auto-Co T on ten public benchmark reasoning tasks, including: (i) arithmetic reasoning (Multi Arith (Roy & Roth, 2015), GSM8K (Cobbe et al., 2021), AQUA-RAT (Ling et al., 2017), SVAMP (Patel et al., 2021)); (ii) commonsense reasoning (CSQA (Talmor et al., 2019), Strategy QA (Geva et al., 2021)); (iii) symbolic reasoning (Last Letter Concatenation, Coin Flip) (Wei et al., 2022b). Experimental results show that Auto-Co T performs on par with Manual-Co T without the need for human intervention. |
| Researcher Affiliation | Collaboration | Zhuosheng Zhang1 , Aston Zhang2, Mu Li2, Alex Smola2 1Shanghai Jiao Tong University, 2Amazon Web Services |
| Pseudocode | Yes | Algorithm 1 Cluster Require: Questions Q, number of chains k Ensure: Sorted questions qi = [qi 1, qi 2, . . .] for each cluster i {1 . . . k} 1: procedure CLUSTER(Q, k) 2: for each question q in Q do 3: Encode q by Sentence-BERT 4: Cluster all questions q into k clusters 5: for each cluster i {1 . . . k} do 6: Sort questions qi = [qi 1, qi 2, . . .] in the ascending order cluster centrality 7: return all qi for i {1 . . . k} Algorithm 2 Construct Demonstrations Require: Sorted question lists qi for all k clusters Ensure: Demonstration list d = [d1, . . . , dk] 1: procedure CONSTRUCT(qi, . . . , qk) 2: d 3: for each cluster i {1 . . . k} do 4: for each question q qi do 5: (rationale r, answer a) via Zero-Shot-Co T(q) 6: if (q, r) satisfy selection heuristic then 7: d d {(q, r, a)} 8: break 9: return d |
| Open Source Code | Yes | Code is available at https://github.com/amazon-research/auto-cot. |
| Open Datasets | Yes | We evaluate Auto-Co T on ten benchmark datasets from three categories of reasoning tasks: (i) arithmetic reasoning (Multi Arith (Roy & Roth, 2015), GSM8K (Cobbe et al., 2021), Add Sub (Hosseini et al., 2014), AQUA-RAT (Ling et al., 2017), Single Eq (Koncel-Kedziorski et al., 2015), SVAMP (Patel et al., 2021)); (ii) commonsense reasoning (CSQA (Talmor et al., 2019), Strategy QA (Geva et al., 2021)); (iii) symbolic reasoning (Last Letter Concatenation, Coin Flip) (Wei et al., 2022b). |
| Dataset Splits | No | The paper does not explicitly state the training, validation, and test dataset splits with percentages, sample counts, or references to predefined splits for their own experiments. |
| Hardware Specification | No | The paper mentions using "the text-davinci-002 version of GPT-3" and "codedavinci-002 of Codex" via the OpenAI API. These are language models, not specific hardware specifications (e.g., GPU models, CPU types, or memory) on which the experiments were run. |
| Software Dependencies | No | The paper mentions using "GPT-3 (text-davinci-002)" and "Codex" via the OpenAI API, and "Sentence-BERT" for question encoding. However, it does not specify version numbers for general software dependencies like Python, PyTorch, or other libraries used in the implementation or for analysis. |
| Experiment Setup | Yes | Following Wei et al. (2022b), the number of demonstrations k is 8 except for AQu A and Letter (4), CSQA (7), and Strategy QA (6)... Greedy decoding is used to generate the output. We set max_tokens = 256 and temperature = 0. |