STaR: Bootstrapping Reasoning With Reasoning
Authors: Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah Goodman
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For our experiments, we focus on arithmetic, commonsense reasoning, and grade school math to demonstrate STa R s breadth. In particular, for arithmetic, we follow a setup inspired by [5]. For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9]. |
| Researcher Affiliation | Collaboration | Eric Zelikman 1, Yuhuai Wu 12, Jesse Mu1, Noah D. Goodman1 1Department of Computer Science, Stanford University 2 Google Research |
| Pseudocode | Yes | Algorithm 1 STa R Input M: a pretrained LLM; dataset D = {(xi, yi)}D i=1 (w/ few-shot prompts) 1: M0 M # Copy the original model 2: for n in 1...N do # Outer loop 3: (ˆri, ˆyi) Mn 1(xi) i [1, D] # Perform rationale generation 4: (ˆrrat i , ˆyrat i ) Mn 1(add_hint(xi, yi)) i [1, D] # Perform rationalization 5: Dn {(xi, ˆri, yi) | i [1, D] ˆyi = yi} # Filter rationales using ground truth answers 6: Drat n {(xi, ˆrrat i , yi) | i [1, D] ˆyi = yi ˆyrat i = yi} # Filter rationalized rationales 7: Mn train(M, Dn Drat n ) # Finetune the original model on correct solutions inner loop 8: end for |
| Open Source Code | Yes | We release our code at https://github.com/ezelikman/STa R. |
| Open Datasets | Yes | For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9]. |
| Dataset Splits | Yes | The dataset has 12,247 questions, each with five choices, with 9,741 in the train set, 1,221 in the dev set, and 1,285 in the (withheld) test set. |
| Hardware Specification | Yes | We thank Google TPU Research Cloud for TPU access. This work was partially supported by SAIL... We thank Google TPU Research Cloud for TPU access. |
| Software Dependencies | No | The paper mentions using GPT-J and a fine-tuning script, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We perform a 100-step learning rate warmup, from which point we use a constant learning rate. Unless stated otherwise, we start with 40 training steps at the first outer loop, and increase the number of inner-loop fine-tuning training steps by 20% with each outer loop. |