Large Language Models of Code Fail at Completing Code with Potential Bugs
Authors: Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, George Karypis
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Via our empirical studies, we find that the presence of potential bugs drastically degrades the code-completion performance of high-performing Code-LLMs, with test-case pass rates dropping to below 5% across both datasets for all tested model variants. |
| Researcher Affiliation | Collaboration | Tuan Dinh1 Jinman Zhao2 Samson Tan2 Renato Negrinho2 Leonard Lausen2 Sheng Zha2 George Karypis2 1University of Wisconsin Madison 2Amazon Web Services |
| Pseudocode | No | No pseudocode or clearly labeled algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code and datasets are available at https://github.com/amazon-science/buggy-code-completion |
| Open Datasets | Yes | To conduct a quantitative study of b CC, we construct two datasets. First, buggy-Human Eval dataset contains interview-style coding problems from Human Eval dataset [10]... Second, buggy-Fix Eval dataset, based on Fix Eval [19], contains user submissions to coding problems (buggy-Fix Eval). |
| Dataset Splits | No | The paper mentions evaluating models using pass@k and test cases but does not provide explicit training, validation, and test dataset splits for model training or evaluation within the constructed datasets. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions using CODEGEN, INCODER, and RealiT models, as well as the Python ast library, but it does not specify exact version numbers for these software components, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Following the best-performing settings reported in the corresponding works [12, 33], we use temperature sampling with temperature = 0.6 for CODEGEN and top-p sampling with p = 0.95 and temperature = 0.2 for INCODER. Based on the reference solutions and computing efficiency, we set the maximum length limit for outputs from 200 to 600 tokens, varying with the problem sets. |