reproducibilityindex.ai

Large Language Models of Code Fail at Completing Code with Potential Bugs

Authors: Tuan Dinh, Jinman Zhao, Samson Tan, Renato Negrinho, Leonard Lausen, Sheng Zha, George Karypis

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Via our empirical studies, we find that the presence of potential bugs drastically degrades the code-completion performance of high-performing Code-LLMs, with test-case pass rates dropping to below 5% across both datasets for all tested model variants.
Researcher Affiliation	Collaboration	Tuan Dinh1 Jinman Zhao2 Samson Tan2 Renato Negrinho2 Leonard Lausen2 Sheng Zha2 George Karypis2 1University of Wisconsin Madison 2Amazon Web Services
Pseudocode	No	No pseudocode or clearly labeled algorithm blocks were found in the paper.
Open Source Code	Yes	Code and datasets are available at https://github.com/amazon-science/buggy-code-completion
Open Datasets	Yes	To conduct a quantitative study of b CC, we construct two datasets. First, buggy-Human Eval dataset contains interview-style coding problems from Human Eval dataset [10]... Second, buggy-Fix Eval dataset, based on Fix Eval [19], contains user submissions to coding problems (buggy-Fix Eval).
Dataset Splits	No	The paper mentions evaluating models using pass@k and test cases but does not provide explicit training, validation, and test dataset splits for model training or evaluation within the constructed datasets.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments were provided in the paper.
Software Dependencies	No	The paper mentions using CODEGEN, INCODER, and RealiT models, as well as the Python ast library, but it does not specify exact version numbers for these software components, which are necessary for full reproducibility.
Experiment Setup	Yes	Following the best-performing settings reported in the corresponding works [12, 33], we use temperature sampling with temperature = 0.6 for CODEGEN and top-p sampling with p = 0.95 and temperature = 0.2 for INCODER. Based on the reference solutions and computing efficiency, we set the maximum length limit for outputs from 200 to 600 tokens, varying with the problem sets.