reproducibilityindex.ai

Automatic Chain of Thought Prompting in Large Language Models

Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, Alex Smola

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Auto-Co T on ten public benchmark reasoning tasks, including: (i) arithmetic reasoning (Multi Arith (Roy & Roth, 2015), GSM8K (Cobbe et al., 2021), AQUA-RAT (Ling et al., 2017), SVAMP (Patel et al., 2021)); (ii) commonsense reasoning (CSQA (Talmor et al., 2019), Strategy QA (Geva et al., 2021)); (iii) symbolic reasoning (Last Letter Concatenation, Coin Flip) (Wei et al., 2022b). Experimental results show that Auto-Co T performs on par with Manual-Co T without the need for human intervention.
Researcher Affiliation	Collaboration	Zhuosheng Zhang1 , Aston Zhang2, Mu Li2, Alex Smola2 1Shanghai Jiao Tong University, 2Amazon Web Services
Pseudocode	Yes	Algorithm 1 Cluster Require: Questions Q, number of chains k Ensure: Sorted questions qi = [qi 1, qi 2, . . .] for each cluster i {1 . . . k} 1: procedure CLUSTER(Q, k) 2: for each question q in Q do 3: Encode q by Sentence-BERT 4: Cluster all questions q into k clusters 5: for each cluster i {1 . . . k} do 6: Sort questions qi = [qi 1, qi 2, . . .] in the ascending order cluster centrality 7: return all qi for i {1 . . . k} Algorithm 2 Construct Demonstrations Require: Sorted question lists qi for all k clusters Ensure: Demonstration list d = [d1, . . . , dk] 1: procedure CONSTRUCT(qi, . . . , qk) 2: d 3: for each cluster i {1 . . . k} do 4: for each question q qi do 5: (rationale r, answer a) via Zero-Shot-Co T(q) 6: if (q, r) satisfy selection heuristic then 7: d d {(q, r, a)} 8: break 9: return d
Open Source Code	Yes	Code is available at https://github.com/amazon-research/auto-cot.
Open Datasets	Yes	We evaluate Auto-Co T on ten benchmark datasets from three categories of reasoning tasks: (i) arithmetic reasoning (Multi Arith (Roy & Roth, 2015), GSM8K (Cobbe et al., 2021), Add Sub (Hosseini et al., 2014), AQUA-RAT (Ling et al., 2017), Single Eq (Koncel-Kedziorski et al., 2015), SVAMP (Patel et al., 2021)); (ii) commonsense reasoning (CSQA (Talmor et al., 2019), Strategy QA (Geva et al., 2021)); (iii) symbolic reasoning (Last Letter Concatenation, Coin Flip) (Wei et al., 2022b).
Dataset Splits	No	The paper does not explicitly state the training, validation, and test dataset splits with percentages, sample counts, or references to predefined splits for their own experiments.
Hardware Specification	No	The paper mentions using "the text-davinci-002 version of GPT-3" and "codedavinci-002 of Codex" via the OpenAI API. These are language models, not specific hardware specifications (e.g., GPU models, CPU types, or memory) on which the experiments were run.
Software Dependencies	No	The paper mentions using "GPT-3 (text-davinci-002)" and "Codex" via the OpenAI API, and "Sentence-BERT" for question encoding. However, it does not specify version numbers for general software dependencies like Python, PyTorch, or other libraries used in the implementation or for analysis.
Experiment Setup	Yes	Following Wei et al. (2022b), the number of demonstrations k is 8 except for AQu A and Letter (4), CSQA (7), and Strategy QA (6)... Greedy decoding is used to generate the output. We set max_tokens = 256 and temperature = 0.