Self-Refine: Iterative Refinement with Self-Feedback
Authors: Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate SELF-REFINE across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5 and GPT-4) LLMs. Across all evaluated tasks, outputs generated with SELF-REFINE are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by 20% absolute on average in task performance. |
| Researcher Affiliation | Collaboration | 1Language Technologies Institute, Carnegie Mellon University 2Allen Institute for Artificial Intelligence 3University of Washington 4NVIDIA 5UC San Diego 6Google Deepmind |
| Pseudocode | Yes | Algorithm 1 SELF-REFINE algorithm |
| Open Source Code | Yes | We release all of our code, which is easily extensible to other LLMs. Code and data at https://selfrefine.info/ |
| Open Datasets | Yes | Dialogue Response Generation (Appendix P; Mehri and Eskenazi, 2020), Code Optimization (Appendix Q; Madaan et al., 2023), Code Readability Improvement (Appendix O; Puri et al., 2021), Math Reasoning (Appendix R; Cobbe et al., 2021), Sentiment Reversal (Appendix S; Zhang et al., 2015), and we introduce two new tasks: Acronym Generation (Appendix T) and Constrained Generation (a harder version of Lin et al. (2020) with 20-30 keyword constraints instead of 3-5; Appendix U) |
| Dataset Splits | No | The paper states, "For automatic evaluation in Table1, we used zero-shot prompting with text-davinci-003 and evaluate on a test set of 342 instances." It mentions "test set" but does not explicitly provide details about training, validation, or overall dataset splits (e.g., percentages, total counts, or standard split citations). |
| Hardware Specification | No | The paper specifies the use of large language models like "GPT-3.5 (text-davinci-003 and gpt-3.5-turbo)" and "GPT-4 (Open AI, 2023)", which are accessed via API. It does not provide details on the specific hardware (e.g., GPU models, CPU types, or memory) used to run experiments or train these models. |
| Software Dependencies | No | The paper specifies the use of various large language models such as "GPT-3.5 (text-davinci-003 and gpt-3.5-turbo)", "GPT-4", "CODEX (code-davinci-002)", "Vicuna-13B", and "LLAMA2-70B". However, it does not list specific software dependencies like programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x, TensorFlow 2.x) that would be needed for reproduction beyond the models themselves. |
| Experiment Setup | Yes | We generate samples using a temperature of 0.7. The FEEDBACK-REFINE iterations continue until the desired output quality or task-specific criterion is reached, up to a maximum of 4 iterations. We consider the temperature of both T = 0.0 (greedy) and T = 0.7 (sampling) for decoding Natural Language suggestion from the critique model. We always use a temperature T = 0.0 (greedy) when decoding Programming Language from the code editor. Due to budget constraints, we run SELF-REFINE for N = 5 iterations. |