Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

STaR: Bootstrapping Reasoning With Reasoning

Authors: Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah Goodman

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For our experiments, we focus on arithmetic, commonsense reasoning, and grade school math to demonstrate STa R s breadth. In particular, for arithmetic, we follow a setup inspired by [5]. For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9].
Researcher Affiliation Collaboration Eric Zelikman 1, Yuhuai Wu 12, Jesse Mu1, Noah D. Goodman1 1Department of Computer Science, Stanford University 2 Google Research
Pseudocode Yes Algorithm 1 STa R Input M: a pretrained LLM; dataset D = {(xi, yi)}D i=1 (w/ few-shot prompts) 1: M0 M # Copy the original model 2: for n in 1...N do # Outer loop 3: (ˆri, ˆyi) Mn 1(xi) i [1, D] # Perform rationale generation 4: (ˆrrat i , ˆyrat i ) Mn 1(add_hint(xi, yi)) i [1, D] # Perform rationalization 5: Dn {(xi, ˆri, yi) | i [1, D] ˆyi = yi} # Filter rationales using ground truth answers 6: Drat n {(xi, ˆrrat i , yi) | i [1, D] ˆyi = yi ˆyrat i = yi} # Filter rationalized rationales 7: Mn train(M, Dn Drat n ) # Finetune the original model on correct solutions inner loop 8: end for
Open Source Code Yes We release our code at https://github.com/ezelikman/STa R.
Open Datasets Yes For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9].
Dataset Splits Yes The dataset has 12,247 questions, each with five choices, with 9,741 in the train set, 1,221 in the dev set, and 1,285 in the (withheld) test set.
Hardware Specification Yes We thank Google TPU Research Cloud for TPU access. This work was partially supported by SAIL... We thank Google TPU Research Cloud for TPU access.
Software Dependencies No The paper mentions using GPT-J and a fine-tuning script, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We perform a 100-step learning rate warmup, from which point we use a constant learning rate. Unless stated otherwise, we start with 40 training steps at the first outer loop, and increase the number of inner-loop fine-tuning training steps by 20% with each outer loop.