Unsupervised Evaluation of Code LLMs with Round-Trip Correctness
Authors: Miltiadis Allamanis, Sheena Panthaplackel, Pengcheng Yin
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that RTC strongly correlates with existing metrics on narrow-domain benchmarks (Human Eval and ARCADE) measuring the same LLM capability within that narrow domain (Sec. 4.1). We show that RTC allows us to measure an LLM s performance over a wide-range of real-life software domains without human-provided annotations and complements existing narrow-domain benchmarks (Sec. 4.2). |
| Researcher Affiliation | Industry | 1Google DeepMind. Correspondence to: Miltiadis Allamanis <mallamanis@google.com>. |
| Pseudocode | No | The paper provides conceptual diagrams and summary tables (e.g., Figure 1, Table 1) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format. |
| Open Source Code | Yes | Our code can be found at https://github.com/ google-deepmind/icml2024-roundtrip-correctness. |
| Open Datasets | Yes | To test this hypothesis we employ Human Eval (Chen et al., 2021) and ARCADE (Yin et al., 2022) that represent two common domains... To investigate this question, we collect a set of 77 of permissively licensed open-source Python projects... The used projects are shown in Appendix A. |
| Dataset Splits | No | The paper mentions drawing samples and using few-shot prompting, but it does not provide specific percentages or counts for training, validation, or test dataset splits, nor does it refer to predefined or stratified splitting methodologies for reproduction. |
| Hardware Specification | No | The paper mentions 'Given the time constraints and compute limitations,' but it does not specify any hardware details such as GPU models, CPU types, or other computing resources used for the experiments. |
| Software Dependencies | No | The paper does not list specific software dependencies with version numbers, such as programming language versions or library versions (e.g., 'Python 3.8', 'PyTorch 1.9'). |
| Experiment Setup | Yes | Unless stated otherwise, to compute RTC we draw 3 forward samples and one backward sample per forward sample. We use temperature of 0.8 for the forward model (to allow for disparate forward samples) and 0.1 for the backward samples (to generate high-probability code generations). We use three-shot prompting with identical few-shot prompts for all models. Finally, we limit the length of the forward samples to 128 characters. |