Large Language Models of Code Fail at Completing Code with Potential Bugs

Authors: Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, George Karypis

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Via our empirical studies, we find that the presence of potential bugs drastically degrades the code-completion performance of high-performing Code-LLMs, with test-case pass rates dropping to below 5% across both datasets for all tested model variants.
Researcher Affiliation Collaboration Tuan Dinh1 Jinman Zhao2 Samson Tan2 Renato Negrinho2 Leonard Lausen2 Sheng Zha2 George Karypis2 1University of Wisconsin Madison 2Amazon Web Services
Pseudocode No No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code Yes Code and datasets are available at https://github.com/amazon-science/buggy-code-completion
Open Datasets Yes To conduct a quantitative study of b CC, we construct two datasets. First, buggy-Human Eval dataset contains interview-style coding problems from Human Eval dataset [10]... Second, buggy-Fix Eval dataset, based on Fix Eval [19], contains user submissions to coding problems (buggy-Fix Eval).
Dataset Splits No The paper mentions evaluating models using pass@k and test cases but does not provide explicit training, validation, and test dataset splits for model training or evaluation within the constructed datasets.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were provided in the paper.
Software Dependencies No The paper mentions using CODEGEN, INCODER, and RealiT models, as well as the Python ast library, but it does not specify exact version numbers for these software components, which are necessary for full reproducibility.
Experiment Setup Yes Following the best-performing settings reported in the corresponding works [12, 33], we use temperature sampling with temperature = 0.6 for CODEGEN and top-p sampling with p = 0.95 and temperature = 0.2 for INCODER. Based on the reference solutions and computing efficiency, we set the maximum length limit for outputs from 200 to 600 tokens, varying with the problem sets.