Self-Refine: Iterative Refinement with Self-Feedback

Authors: Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, Peter Clark

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SELF-REFINE across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5 and GPT-4) LLMs. Across all evaluated tasks, outputs generated with SELF-REFINE are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by 20% absolute on average in task performance.
Researcher Affiliation Collaboration 1Language Technologies Institute, Carnegie Mellon University 2Allen Institute for Artificial Intelligence 3University of Washington 4NVIDIA 5UC San Diego 6Google Deepmind
Pseudocode Yes Algorithm 1 SELF-REFINE algorithm
Open Source Code Yes We release all of our code, which is easily extensible to other LLMs. Code and data at https://selfrefine.info/
Open Datasets Yes Dialogue Response Generation (Appendix P; Mehri and Eskenazi, 2020), Code Optimization (Appendix Q; Madaan et al., 2023), Code Readability Improvement (Appendix O; Puri et al., 2021), Math Reasoning (Appendix R; Cobbe et al., 2021), Sentiment Reversal (Appendix S; Zhang et al., 2015), and we introduce two new tasks: Acronym Generation (Appendix T) and Constrained Generation (a harder version of Lin et al. (2020) with 20-30 keyword constraints instead of 3-5; Appendix U)
Dataset Splits No The paper states, "For automatic evaluation in Table1, we used zero-shot prompting with text-davinci-003 and evaluate on a test set of 342 instances." It mentions "test set" but does not explicitly provide details about training, validation, or overall dataset splits (e.g., percentages, total counts, or standard split citations).
Hardware Specification No The paper specifies the use of large language models like "GPT-3.5 (text-davinci-003 and gpt-3.5-turbo)" and "GPT-4 (Open AI, 2023)", which are accessed via API. It does not provide details on the specific hardware (e.g., GPU models, CPU types, or memory) used to run experiments or train these models.
Software Dependencies No The paper specifies the use of various large language models such as "GPT-3.5 (text-davinci-003 and gpt-3.5-turbo)", "GPT-4", "CODEX (code-davinci-002)", "Vicuna-13B", and "LLAMA2-70B". However, it does not list specific software dependencies like programming language versions (e.g., Python 3.x) or library versions (e.g., PyTorch 1.x, TensorFlow 2.x) that would be needed for reproduction beyond the models themselves.
Experiment Setup Yes We generate samples using a temperature of 0.7. The FEEDBACK-REFINE iterations continue until the desired output quality or task-specific criterion is reached, up to a maximum of 4 iterations. We consider the temperature of both T = 0.0 (greedy) and T = 0.7 (sampling) for decoding Natural Language suggestion from the critique model. We always use a temperature T = 0.0 (greedy) when decoding Programming Language from the code editor. Due to budget constraints, we run SELF-REFINE for N = 5 iterations.