Teaching Large Language Models to Self-Debug

Authors: Xinyun Chen, Maxwell Lin, Nathanael Schärli, Denny Zhou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate SELF-DEBUGGING on a variety of models, showing that SELF-DEBUGGING achieves the state-of-the-art performance on different types of code generation tasks. On the Spider benchmark (Yu et al., 2018) for text-to-SQL generation where there are no unit tests in the problem description, with code explanation, SELF-DEBUGGING consistently improves the baseline by 2−3% with different numbers of initial programs, and improves the prediction accuracy on the most complicated SQL queries by 9%. On both TransCoder for code translation (Roziere et al., 2020) and MBPP for text-to-Python generation (Austin et al., 2021), utilizing unit tests along with code explanation boosts the accuracy by up to 12%, and code explanation alone without debugging also consistently improves the code translation performance by 2−3%.
Researcher Affiliation Collaboration Xinyun Chen1 Maxwell Lin2 Nathanael Schärli1 Denny Zhou1 1 Google DeepMind 2 UC Berkeley {xinyunchen,schaerli,dennyzhou}@google.com, mxlin@berkeley.edu
Pseudocode No The paper does not contain any pseudocode blocks or sections explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No The paper states: 'We use publicly accessible large language models for evaluation: API access of GPT models is available at https://openai.com/api, and Star Coder is an open-source LLM (Li et al., 2023b).' This refers to models they *used*, not their own code implementation for SELF-DEBUGGING.
Open Datasets Yes We evaluate SELF-DEBUGGING on the development set of the Spider benchmark (Yu et al., 2018)... In our experiments, we use the TransCoder dataset (Roziere et al., 2020)... Specifically, we perform experiments on the test set of MBPP (Austin et al., 2021)...
Dataset Splits Yes We evaluate SELF-DEBUGGING on the development set of the Spider benchmark (Yu et al., 2018)... Specifically, we perform experiments on the test set of MBPP (Austin et al., 2021), which contains 500 Python problems with text descriptions, where each problem has 3 unit tests. We follow prior work (Shi et al., 2022; Ni et al., 2023) in including the first unit test in the prompt as part of the problem description, and keeping the remaining 2 unit tests hidden for full evaluation.
Hardware Specification No The paper mentions the use of specific large language models (e.g., 'code-davinci-002', 'gpt-3.5-turbo', 'gpt-4', and 'Star Coder with 15.5B parameters'), but does not specify the underlying hardware (e.g., GPU models, CPU types) on which these models were run for their experiments.
Software Dependencies No The paper states: 'API access of GPT models is available at https://openai.com/api, and Star Coder is an open-source LLM (Li et al., 2023b).' This mentions the LLMs used but does not provide specific software dependencies or their version numbers, such as programming languages, libraries, or frameworks used for implementing the SELF-DEBUGGING framework.
Experiment Setup Yes For initial code generation, when starting from one program, we perform greedy decoding with temperature τ = 0. When sampling multiple programs for a problem, we set temperature τ = 0.7, then we perform execution-based selection described in Section 2. ... All experiments for SELF-DEBUGGING use greedy decoding to generate feedback messages and new programs. We set the maximum number of debugging turns to be 10, though empirically the successful debugging processes mostly end within 3 turns. We present the full prompts for experiments in the appendix.