reproducibilityindex.ai

STaR: Bootstrapping Reasoning With Reasoning

Authors: Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah Goodman

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	For our experiments, we focus on arithmetic, commonsense reasoning, and grade school math to demonstrate STa R s breadth. In particular, for arithmetic, we follow a setup inspired by [5]. For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9].
Researcher Affiliation	Collaboration	Eric Zelikman 1, Yuhuai Wu 12, Jesse Mu1, Noah D. Goodman1 1Department of Computer Science, Stanford University 2 Google Research
Pseudocode	Yes	Algorithm 1 STa R Input M: a pretrained LLM; dataset D = {(xi, yi)}D i=1 (w/ few-shot prompts) 1: M0 M # Copy the original model 2: for n in 1...N do # Outer loop 3: (ˆri, ˆyi) Mn 1(xi) i [1, D] # Perform rationale generation 4: (ˆrrat i , ˆyrat i ) Mn 1(add_hint(xi, yi)) i [1, D] # Perform rationalization 5: Dn {(xi, ˆri, yi) \| i [1, D] ˆyi = yi} # Filter rationales using ground truth answers 6: Drat n {(xi, ˆrrat i , yi) \| i [1, D] ˆyi = yi ˆyrat i = yi} # Filter rationalized rationales 7: Mn train(M, Dn Drat n ) # Finetune the original model on correct solutions inner loop 8: end for
Open Source Code	Yes	We release our code at https://github.com/ezelikman/STa R.
Open Datasets	Yes	For commonsense question-answering we follow [13, 6] and use Commonsense QA (CQA), a widely used multiple-choice dataset for this domain [10]. For grade school math, we use GSM8K from [9].
Dataset Splits	Yes	The dataset has 12,247 questions, each with five choices, with 9,741 in the train set, 1,221 in the dev set, and 1,285 in the (withheld) test set.
Hardware Specification	Yes	We thank Google TPU Research Cloud for TPU access. This work was partially supported by SAIL... We thank Google TPU Research Cloud for TPU access.
Software Dependencies	No	The paper mentions using GPT-J and a fine-tuning script, but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We perform a 100-step learning rate warmup, from which point we use a constant learning rate. Unless stated otherwise, we start with 40 training steps at the first outer loop, and increase the number of inner-loop fine-tuning training steps by 20% with each outer loop.