Generating Sequences by Learning to Self-Correct
Authors: Sean Welleck, Ximing Lu, Peter West, Faeze Brahman, Tianxiao Shen, Daniel Khashabi, Yejin Choi
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that SELF-CORRECTION improves upon the base generator in three diverse generation tasks mathematical program synthesis, lexically-constrained generation, and toxicity control even when the corrector is much smaller than the base generator.3 EXPERIMENTS We evaluate SELF-CORRECTION on a diversity of tasks: mathematical program synthesis, in which generations are strictly correct or incorrect, and generators typically have low performance; lexically-constrained generation, which allows for partial credit, and generators usually give partially-correct solutions (e.g. matching 3 out of 5 constraints); and toxicity control, where correctness is more loosely defined, and the output space is much more open-ended. |
| Researcher Affiliation | Collaboration | 1Allen Institute for Artificial Intelligence 2Center for Language and Speech Processing, Johns Hopkins University 3Paul G. Allen School of Computer Science & Engineering, University of Washington |
| Pseudocode | Yes | Algorithm 1 Self-corrective learning input Generator p0, corrector pθ, prompts X, value v( ), feedback f( ) Initialize datapool D by sampling from p0 Initialization: Eq. 2 for iteration {1, 2, . . .} do Form value-improving pairs P from D Pairing: Eq. 3 for step in 1, 2, . . . , M do Sample a batch of value-improving pairs from P using Eq. 4 Compute the loss and update θ using gradient descent Learning for x X do Sample hypotheses y from datapool D Generate corrections y pθ( |y, x, f(y)) Add all (x, y , v(y ), f(y )) to the datapool D Exploration: Eq. 5 |
| Open Source Code | Yes | Code will be available at www.github.com/wellecks/self_correction. |
| Open Datasets | Yes | We evaluate on problems from 5 problem solving datasets: Multi Arith (Roy et al., 2015), Add Sub (Hosseini et al., 2014), Single Op (Roy et al., 2015), SVAMP (Patel et al., 2021), and GSM8k (Cobbe et al., 2021).We experiment on COMMONGEN (Lin et al., 2020) and E2E (Novikova et al., 2017). |
| Dataset Splits | Yes | For the Multi Arith and Multitask settings, we make train/valid/test splits using 60/20/20% of the respective datasets. Similar to Ni et al. (2022), for the GSM setting we use the official GSM8k test split, and create a validation split using 20% of the training set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the 'Huggingface library' and 'Sentence Transformers' but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Tables 8 and 9 show hyperparameters for Common Gen and E2E. Hyperparameter Assignment Predictor GPT-2Large steps 6000 batch size 128 optimizer Adam learning rate 1.e 5 decoding alg. beam search (k=5). We use greedy decoding for the generator and corrector, and k = 1.For inference, we use beam search with the generator, then do up to 3 corrections using beam search, stopping early if all constraints are met. |