Unsupervised Evaluation of Code LLMs with Round-Trip Correctness

Authors: Miltiadis Allamanis, Sheena Panthaplackel, Pengcheng Yin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that RTC strongly correlates with existing metrics on narrow-domain benchmarks (Human Eval and ARCADE) measuring the same LLM capability within that narrow domain (Sec. 4.1). We show that RTC allows us to measure an LLM s performance over a wide-range of real-life software domains without human-provided annotations and complements existing narrow-domain benchmarks (Sec. 4.2).
Researcher Affiliation Industry 1Google DeepMind. Correspondence to: Miltiadis Allamanis <mallamanis@google.com>.
Pseudocode No The paper provides conceptual diagrams and summary tables (e.g., Figure 1, Table 1) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes Our code can be found at https://github.com/ google-deepmind/icml2024-roundtrip-correctness.
Open Datasets Yes To test this hypothesis we employ Human Eval (Chen et al., 2021) and ARCADE (Yin et al., 2022) that represent two common domains... To investigate this question, we collect a set of 77 of permissively licensed open-source Python projects... The used projects are shown in Appendix A.
Dataset Splits No The paper mentions drawing samples and using few-shot prompting, but it does not provide specific percentages or counts for training, validation, or test dataset splits, nor does it refer to predefined or stratified splitting methodologies for reproduction.
Hardware Specification No The paper mentions 'Given the time constraints and compute limitations,' but it does not specify any hardware details such as GPU models, CPU types, or other computing resources used for the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers, such as programming language versions or library versions (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup Yes Unless stated otherwise, to compute RTC we draw 3 forward samples and one backward sample per forward sample. We use temperature of 0.8 for the forward model (to allow for disparate forward samples) and 0.1 for the backward samples (to generate high-probability code generations). We use three-shot prompting with identical few-shot prompts for all models. Finally, we limit the length of the forward samples to 128 characters.