STaR: Bootstrapping Reasoning With Reasoning

Authors: Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah Goodman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For our experiments, we focus on arithmetic, commonsense reasoning, and grade school math to demonstrate STa R s breadth. In particular, for arithmetic, we follow a setup inspired by [5]. For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9].
Researcher Affiliation Collaboration Eric Zelikman 1, Yuhuai Wu 12, Jesse Mu1, Noah D. Goodman1 1Department of Computer Science, Stanford University 2 Google Research
Pseudocode Yes Algorithm 1 STa R Input M: a pretrained LLM; dataset D = {(xi, yi)}D i=1 (w/ few-shot prompts) 1: M0 M # Copy the original model 2: for n in 1...N do # Outer loop 3: (ˆri, ˆyi) Mn 1(xi) i [1, D] # Perform rationale generation 4: (ˆrrat i , ˆyrat i ) Mn 1(add_hint(xi, yi)) i [1, D] # Perform rationalization 5: Dn {(xi, ˆri, yi) | i [1, D] ˆyi = yi} # Filter rationales using ground truth answers 6: Drat n {(xi, ˆrrat i , yi) | i [1, D] ˆyi = yi ˆyrat i = yi} # Filter rationalized rationales 7: Mn train(M, Dn Drat n ) # Finetune the original model on correct solutions inner loop 8: end for
Open Source Code Yes We release our code at https://github.com/ezelikman/STa R.
Open Datasets Yes For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9].
Dataset Splits Yes The dataset has 12,247 questions, each with five choices, with 9,741 in the train set, 1,221 in the dev set, and 1,285 in the (withheld) test set.
Hardware Specification Yes We thank Google TPU Research Cloud for TPU access. This work was partially supported by SAIL... We thank Google TPU Research Cloud for TPU access.
Software Dependencies No The paper mentions using GPT-J and a fine-tuning script, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We perform a 100-step learning rate warmup, from which point we use a constant learning rate. Unless stated otherwise, we start with 40 training steps at the first outer loop, and increase the number of inner-loop fine-tuning training steps by 20% with each outer loop.