reproducibilityindex.ai

Beyond Accuracy: Evaluating Self-Consistency of Code Large Language Models with IdentityChain

Authors: Marcus J. Min, Yangruibo Ding, Luca Buratti, Saurabh Pujar, Gail Kaiser, Suman Jana, Baishakhi Ray

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We study eleven Code LLMs and show that they fail to preserve self-consistency, which is indeed a distinct aspect from conventional accuracy.
Researcher Affiliation	Collaboration	1Columbia University 2IBM Research
Pseudocode	No	The paper describes the Identity Chain framework and its steps in prose and with diagrams (e.g., Figure 1), but it does not include a dedicated pseudocode block or a clearly labeled algorithm section.
Open Source Code	Yes	Our code is available at https://github.com/marcusm117/Identity Chain.
Open Datasets	Yes	We evaluate the self-consistency of Code LLMs on two widely adopted benchmarks: Human Eval and MBPP. Human Eval (Chen et al., 2021)... MBPP Austin et al. (2021)...
Dataset Splits	Yes	Specifically, we use Human Eval Plus-Mini-v0.1.6 where each problem has 16.5 test cases on average. ... For more precise evaluations, we use the test split of the sanitized version of MBPP, which contains 257 problems manually verified by Austin et al. (2021).
Hardware Specification	Yes	We run open-source model experiments on NVIDIA RTX A6000 GPUs with CUDA 11.3, cu DNN8devel, Py Torch 1.12.1, and Python 3.10.9.
Software Dependencies	Yes	We run open-source model experiments on NVIDIA RTX A6000 GPUs with CUDA 11.3, cu DNN8devel, Py Torch 1.12.1, and Python 3.10.9.
Experiment Setup	Yes	For all models, we use greedy decoding for our main experiment in Section 6.1. ... Therefore, we set the temperature to 0 to minimize the randomness. ... For efficiency, we set the max prompt length to be 1,024 tokens, the max generation length to be 512 tokens, and the inference precision to be FP16. We use one-shot prompting for all the models on both benchmarks...