Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Authors: Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy.
Researcher Affiliation Collaboration 1Columbia University 2IBM Research
Pseudocode No The paper describes the Identity Chain framework and its steps in prose and with diagrams (e.g., Figure 1), but it does not include a dedicated pseudocode block or a clearly labeled algorithm section.
Open Source Code Yes Our code is available at https://github.com/marcusm117/Identity Chain.
Open Datasets Yes We evaluate the self-consistency of Code LLMs on two widely adopted benchmarks: Human Eval and MBPP. Human Eval (Chen et al., 2021)... MBPP Austin et al. (2021)...
Dataset Splits Yes Specifically, we use Human Eval Plus-Mini-v0.1.6 where each problem has 16.5 test cases on average. ... For more precise evaluations, we use the test split of the sanitized version of MBPP, which contains 257 problems manually verified by Austin et al. (2021).
Hardware Specification Yes We run open-source model experiments on NVIDIA RTX A6000 GPUs with CUDA 11.3, cu DNN8devel, Py Torch 1.12.1, and Python 3.10.9.
Software Dependencies Yes We run open-source model experiments on NVIDIA RTX A6000 GPUs with CUDA 11.3, cu DNN8devel, Py Torch 1.12.1, and Python 3.10.9.
Experiment Setup Yes For all models, we use greedy decoding for our main experiment in Section 6.1. ... Therefore, we set the temperature to 0 to minimize the randomness. ... For efficiency, we set the max prompt length to be 1,024 tokens, the max generation length to be 512 tokens, and the inference precision to be FP16. We use one-shot prompting for all the models on both benchmarks...