Large Language Models Cannot Self-Correct Reasoning Yet

Authors: Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We use datasets where existing self-correction methods with oracle labels have demonstrated significant performance improvement, including GSM8K (Cobbe et al., 2021): GSM8K comprises a test set of 1,319 linguistically diverse grade school math word problems, curated by human problem writers. [...] Tables 3 and 4 report the accuracies and the number of model calls. We observe that, after selfcorrection, the accuracies of all models drop across all benchmarks.
Researcher Affiliation Collaboration Jie Huang1,2 Xinyun Chen1 Swaroop Mishra1 Huaixiu Steven Zheng1 Adams Wei Yu1 Xinying Song1 Denny Zhou1 1Google Deep Mind 2University of Illinois at Urbana-Champaign
Pseudocode No No section or figure explicitly labeled 'Pseudocode' or 'Algorithm' was found.
Open Source Code No The paper states it uses publicly accessible models (GPT-3.5, GPT-4, Llama-2) and provides prompts in Appendix A, but it does not offer a link to the open-source code for the methodology developed in this paper.
Open Datasets Yes We use datasets where existing self-correction methods with oracle labels have demonstrated significant performance improvement, including GSM8K (Cobbe et al., 2021): GSM8K comprises a test set of 1,319 linguistically diverse grade school math word problems, curated by human problem writers. Common Sense QA (Talmor et al., 2019): This dataset offers a collection of multi-choice questions that test commonsense reasoning. Hotpot QA (Yang et al., 2018): Hotpot QA is an open-domain multi-hop question answering dataset.
Dataset Splits No The paper mentions using a 'test set' for GSM8K and a 'dev set' for Common Sense QA, and sampling for testing. It refers to these as evaluation sets rather than specifying a train/validation/test split for reproducibility of data partitioning.
Hardware Specification No The paper states 'We use GPT-3.5-Turbo (gpt-3.5-turbo-0613) and GPT-4 accessed on 2023/08/29. For intrinsic self-correction, to provide a more thorough analysis, we also evaluate GPT-4-Turbo (gpt-4-1106-preview) and Llama-2 (Llama-2-70b-chat)'. It does not specify any particular hardware (GPU/CPU models, memory) used to run these models or their experiments.
Software Dependencies No The paper specifies the LLM models and their versions used (e.g., 'gpt-3.5-turbo-0613', 'gpt-4-1106-preview', 'Llama-2-70b-chat'), which are APIs or specific model weights. However, it does not provide version numbers for ancillary software dependencies like programming languages (e.g., Python), frameworks (e.g., PyTorch, TensorFlow), or libraries used to run the experiments.
Experiment Setup Yes We prompt the models to undergo a maximum of two rounds of self-correction. We use a temperature of 1 for GPT-3.5-Turbo and GPT-4, and a temperature of 0 for GPT-4-Turbo and Llama-2, to provide evaluation across different decoding algorithms. Following Kim et al. (2023); Shinn et al. (2023), we apply a three-step prompting strategy for self-correction: 1) prompt the model to perform an initial generation (which also serves as the results for Standard Prompting); 2) prompt the model to review its previous generation and produce feedback; 3) prompt the model to answer the original question again with the feedback.