Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain
Authors: Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy. |
| Researcher Affiliation | Collaboration | 1Columbia University 2IBM Research |
| Pseudocode | No | The paper describes the Identity Chain framework and its steps in prose and with diagrams (e.g., Figure 1), but it does not include a dedicated pseudocode block or a clearly labeled algorithm section. |
| Open Source Code | Yes | Our code is available at https://github.com/marcusm117/Identity Chain. |
| Open Datasets | Yes | We evaluate the self-consistency of Code LLMs on two widely adopted benchmarks: Human Eval and MBPP. Human Eval (Chen et al., 2021)... MBPP Austin et al. (2021)... |
| Dataset Splits | Yes | Specifically, we use Human Eval Plus-Mini-v0.1.6 where each problem has 16.5 test cases on average. ... For more precise evaluations, we use the test split of the sanitized version of MBPP, which contains 257 problems manually verified by Austin et al. (2021). |
| Hardware Specification | Yes | We run open-source model experiments on NVIDIA RTX A6000 GPUs with CUDA 11.3, cu DNN8devel, Py Torch 1.12.1, and Python 3.10.9. |
| Software Dependencies | Yes | We run open-source model experiments on NVIDIA RTX A6000 GPUs with CUDA 11.3, cu DNN8devel, Py Torch 1.12.1, and Python 3.10.9. |
| Experiment Setup | Yes | For all models, we use greedy decoding for our main experiment in Section 6.1. ... Therefore, we set the temperature to 0 to minimize the randomness. ... For efficiency, we set the max prompt length to be 1,024 tokens, the max generation length to be 512 tokens, and the inference precision to be FP16. We use one-shot prompting for all the models on both benchmarks... |